Loaders to import datasets into Open Spending¶

What are loaders¶

A loader is a way to import a dataset into Open Spending. These loaders use one or more dataset sources and create Entry, Entity and Classifier entries in the database. You can use whatever python can read,for example csv, json, xls or xml files. We also provide helper functions to read data directly from google docs.

Note that we are working on a way to let users import csv files with through the Open Spending web interface.

How to write a loader¶

In this section you will write a demo loader to import a simple csv file that provides all the data we need. The minimal set of data you need for an import spendings is (named with their internal keys):

to: The recipient, which an Entity (i.e. a source or sink for money).
from: The spender, also an Entity .
amount: The amount spent
time: The time period to which the spending data refers. This could be years, month or days (but we usually analyze data per year. )
currency: The currency the amount is quoted in.
classifiers (optional): Information to classify a spending Entry, e.g. the region department or sector it was spent on. This is not required but can be used to provide the user different ways to explore the data.

Not all of this information has to be present in your data source. A common case is that for all spending data to or from `` is the society, or ``currency is alway the same, so you can hardcode values in your data loader.

The demo source file is a csv file that contains one spending Entry per line. You know that the currency of all spendings is EUR. The file contains the following columns:

id: A distinct id for every spending Entry (a journal number/number of entry)
spender_id: A distinct id for the spender.
spender_name: The name of the spender.
recipient_id: A distinct id for the recipient.
recipient_name: The name of the recipient.
region: The region the money is spent on.
sector: The sector the money is spent on.
date: The day the money is spent on.
amount: The amount spent.

Here is your data (download demoloader.csv):

demoloader.csv
id	spender_id	spender_name	recipient_id	recipient_name	region	sector	date	amount
id	spender_id	spender_name	recipient_id	recipient_name	region	sector	date	amount
demo-sp-001	dfes	Department for Education	dtlr	Department for the Regions	North Yorkshire	Social Protection	2010-01-01	1200000
demo-sp-002	dfes	Department for Education	dtlr	Department for the Regions	North Yorkshire	Education	2010-02-01	800000
demo-sp-003	dfes	Department for Education	society	General Public	North Yorkshire	Education	2011-03-01	500000
demo-sp-004	dtlr	Department for the Regions	society	General Public	Hartlepool	Health	2011-04-01	1400000
demo-sp-005	dtlr	Department for the Regions	dfes	Department for Education	Hartlepool	Heath	2010-05-01	260000
demo-sp-006	dtlr	Department for the Regions	dfes	Department for Education	Hartlepool	Social Protection	2011-06-01	1150000

Create a Loader¶

First you create a Loader. This is a class that lets you create Entry, Entity and Classifier objects in an Open Spending site easily, quickly and reliably without a deep knowledge of the saved data structures.

def make_loader():
    '''return a loader for our demo dataset'''
    loader = Loader(dataset_name=u'demodata',
                    unique_keys=['name'],
                    label=u'Demo data set',
                    description="A dataset to show how to write a data loader")
    return loader

Line 2-5: We create the loader
- dataset_name: is the (internal) name of our dataset. This should be unique within the database.
- unique_keys: the key (or combination of keys) that can identify one spending Entry uniquely. In this case it is enough to use the content of the id column. You will use that as the name of an entry. Do not use id on any Dataset, Entry, Entity or Classifier ever. id has a special meaning internally.
- label The dataset label that you present to the user on the website.
- description: A description you can present to the user. It’s not strictly required but we recommend to add 1-2 sentences.

At the end you return the instanciated loader to use it later. Internally the loader automatically creates a wdmmg.model.Dataset for you.

>>> loader = make_loader
>>> loader.dataset.name
u'demodata'

See the Loader API for the available methods and the options to create it.

Read the data file¶

Now you can read the csv file. To do that, create a function that you can pass the filename to (i.e. You don’t need to write a second function to read a second, similar file, can use the same function multiple times). This is especially useful if you want to write tests for your data loader.

def read_data(filename):
    '''read a file with data for our demo dataset
    and return a list of dicts. (To be corret it's a generator with
    which we can iterate over our dictionaries one at a time.'''

    csv_file = file(filename)
    rows = csv.DictReader(csv_file)

    # convert all strings to unicodes
    converted = []
    for row in rows:
        for key in row:
            row[key] = unicode(row[key], 'utf-8')
        converted.append(row)

    return converted

Line 6: Open the file
Line 7: Use it to create a csv.DictReader. It will create a list with a dict for every line in the file. The first line, the headers, will be used as keys in the dict. See the python documentation for details how it works.
Lines 8-14: Convert all values from the csv file to unicode values. Otherwise you may run into encoding errors later if the file contains non-ascii characters like french accent or german umlauts.
Line 16: Return the rows

Note

If you want to work with massive amounts of data, you want to use generator functions in many places and yield the single rows for performance reasons.

Now you have a list with one dict for every row:

>>> filename = '../wdmmg/tests/demoloader.csv'
>>> rows = read_data(filename)
>>> first_row = rows.next()
>>> first_row
{'amount': '1200000',
 'date': '2010-01-01',
 'id': 'demo-sp-001',
 'recipient_id': 'dtlr',
 'recipient_name': 'Department for the Regions',
 'region': 'North Yorkshire',
 'sector': 'Social Protection',
 'spender_id': 'dfes',
 'spender_name': 'Department for Education'}

Save the Entries into the database¶

Now that you have the dictionaries with the data you can save it into the database. You have to refine the data a bit as the DictReader returns dicts where the keys are the column headers, and all values are strings. But:

The column id in the table will be the name of the entry you create.
The amount is a number.
date is a date. The main date of an Entry (the time axis of the dataset) is treated specially. It is saved on the Entry with the key time and in a special data structure.
spender and recipient are not only text. In the dataset they are Entities. A Entity is a person or a thing (e.g. an administrative region, a company etc) that can spend or receive money. Entities are not bound to a special dataset. E.g. one entity that we want to be unique across all datasets is the society (see Default Society)
region and sector are not only text. They can be used to classify each entry. Users will be able to navigate the data with the help of classifiers. A Classifier is not bound to a dataset, it is part of a taxonomy. E.g. the European Union has a system to classify spending entries of their Organisations and members (COFOG). Usually you reuse an existing taxonomy or create one custom to your dataset, but you could even create Classifiers in different taxonomies.

Here we show you how to create custom ones.

So let’s prepare the data and save Entry, Entities and Classifiers.

def save_entry(loader, row):
    '''prepare a ``row`` and save the *entry*, the *entities*
    and the *classifiers* in the database.
    '''

    # This will hold the data we save as the entry later
    entry_data = {}

    # 1. The name of the *entry* is the id
    entry_data['name'] = row['id']

    # 2. convert the amount to a float and add it to entry_data
    amount = float(row['amount'])
    entry_data['amount'] = amount

    # 3. convert the date into a time structure we need and add it
    #    as 'time'
    date = row['date']
    timestructure = for_datestrings(date)
    entry_data['time'] = timestructure

    # 4. Turn the spender and the recipient into wdmmg.model.Entity objects
    #    and add them as 'from' and 'to'
    spender_id = row['spender_id']
    spender_name = row['spender_name']
    spender_entity = loader.create_entity(spender_id, spender_name)
    entry_data['from'] = spender_entity

    recipient_id = row['recipient_id']
    recipient_name = row['recipient_name']
    recipient_entity = loader.create_entity(recipient_id, recipient_name)
    entry_data['to'] = recipient_entity

    # 5. create wdmmg.model.Classifier objects for region and sector...
    region_name = row['region']
    region_classifier = loader.create_classifier(region_name, u'demotaxonomy')
    sector_name = row['sector']
    sector_classifier = loader.create_classifier(sector_name, u'demotaxonomy')

    # ... and classify the entry. The necessarey information will be added
    #     to entry_data
    loader.classify_entry(entry_data, region_classifier, name=u'region')
    loader.classify_entry(entry_data, sector_classifier, name=u'sector')

    # 6. save the entry
    query_spec = loader.create_entry(**entry_data)

Line 19: Save the time in a special datastructure. wdmmg.lib.times contains helper functions to create it.
Line 26, 31: Loader.create_entity() creates a wdmmg.model.Entity object and saves it to the database automatically. If a entity with the same name exists already it will be updated. The entity object is returned so you can use it.
Line 36, 38: Loader.create_classifier() creates a wdmmg.model.Classifier object and saves it to the database automatically. If a classifier with the same name exists already in the same taxonomy, it will be updated. The entity object is returned so you can use it.
Line 42, 43: Use the classifiers to classify the entry_data. Additional information will be added to the dict. name to become the key of the classifier. This can be used to get more information about the meaning of the classifier.
Line 46: Finally create the wdmmg.model.Entry. It will be saved automatically in the database but create_entry will not return an Entry object for performance reasons. It will return query_spec (a mongodb query spec). If you have to do further work on the entry you can retrieve it with model.Entity.find_one(query_spec).

Dimensions - Describe the data for the users¶

Sweet. The system can read your data and you can make a loader and save entries and other data. So you’re nearly there.

Most of the data you save is pretty standard. Amount, sender, recipient, currency and date are generally understandable and used across all datasets. But for the special data in the demo dataset, region and sector, you need to describe how users should interprete them. The system also need to know how a user should be able to analyze the data.

To describe the kind of data you store in the Entries use Loader.create_dimension().

def create_dimensions(loader):
    '''Describe which extra data we save beside the common data.'''

    loader.create_dimension(
        'sector', label='Sector',
        description=('The sectors the money is spend in as described in '
                     'catalog FOO published by the department BAR. '
                     'It should be mostly accurate it is possible '
                     'that it was repurposed by the recipient.'))

    loader.create_dimension(
        'region', label="Regional",
        description=("The money was appropriated to the be used in "
                     "the named region. It will have be given to a number "
                     "of regional buereaus. However some these regional "
                     "bureaus have special functions that are not related "
                     "to the region."))

Put it all together¶

To wire it all together, write a load function that finds the source file and does all the things above. It’s important to call Loader.compute_aggregates() at the end. This will collect additional information in the database.

def load():
    '''load the demo data into Open Spending'''
    directory = os.path.dirname(__file__)
    filename = os.path.join(directory, 'demoloader.csv')

    loader = make_loader()
    create_dimensions(loader)

    rows = read_data(filename)
    for row in rows:
        save_entry(loader, row)

    loader.compute_aggregates()

The complete loader can be downloaded here.

Loaders are normally used on the command line with the command:

paster load <loader_name>

To add your loader to the load command, add an entry point to the python package you created in the setup.py for wdmmg.load. In this case the demo loader is located in wdmmg/tests/demoloader.py and the function is load:

setup(...
      entry_points="""
      [wdmmg.load]
      demo = wdmmg.tests.demoloader:load
      """
      )

Tests¶

If you write a loader that is used by other people you should also write tests to make sure everything works as expected. There might be changes in newer versions of the source file or in newer versions of OpenSpending that will be unnoted if there are no tests.

You can find a set of tests in test_demoloader.py

'''This are the tests for the demoloader that is used in
the loader howto in /doc/loader.rst
'''

from unittest import TestCase

from wdmmg.model import init_serverside_js, Repository
from wdmmg.model.mongo import db, drop_db
from wdmmg.lib.loader import Loader


class TestDemoLoader(TestCase):

    def setUp(self):
        # create the collections so we have the
        # default indexes when we start
        Repository().delete_all()
        self.db = db()
        self.db.create_collection('entry')
        self.db.create_collection('entity')
        # upload server side javascript
        init_serverside_js()
        # fixme: we have to ease the tests for new loaders

    def tearDown(self):
        Repository().delete_all()
        drop_db()

    def test_create_loader(self):
        from wdmmg.tests.demoloader import make_loader
        loader = make_loader()
        self.assertTrue(isinstance(loader, Loader))
        self.assertEqual(loader.dataset.name, 'demodata')

    def test_read_data(self):
        from os.path import dirname, join
        from wdmmg.tests.demoloader import read_data

        directory = dirname(__file__)
        filename = join(directory, 'demoloader.csv')
        rows = read_data(filename)
        self.assertEqual(len(rows), 6)
        self.assertEqual(rows[0],
                         {'amount': '1200000',
                          'date': '2010-01-01',
                          'id': 'demo-sp-001',
                          'recipient_id': 'dtlr',
                          'recipient_name': 'Department for the Regions',
                          'region': 'North Yorkshire',
                          'sector': 'Social Protection',
                          'spender_id': 'dfes',
                          'spender_name': 'Department for Education'})

    def test_complete_run(self):
        from wdmmg.model import Entry
        from wdmmg.tests.demoloader import load

        load()
        # test that we can find one of our entries
        self.assertTrue(Entry.find_one({'dataset.name': 'demodata'}))

Tips and Tricks¶

Default Society¶

Often you have entities that name the Society as the spender or recipient. You should not use create_entity() then but use Loader.get_default_society().

Unique Keys/Names¶

The names for Entries must be unique for the dataset, the name of Classifiers unique across all classifiers in the taxonomy and the name of Entities unique within an OpenSpending installation.

This will most of the time be no problem. You often have unique id’s like journal numbers for the Entries. If you do not have them, you can

use functions from wdmmgext.load.utils to create keys
prefix keys you have with the name of your dataset
For Classifiers create a new taxonomy specific for your dataset.

Hierarchies¶

If you have a hierarchy, say:

Department: “Department Foo”
- Subdepartment: “Subdepartment Bar”
  - Bureau “Bureau for extraordinary affairs”

foo = loader.create_entity('foo', "Department Foo")
bar = loader.create_entity('bar', "Subdepartment Bar")
bureau = loader.create_entity('extraordinary-affairs', "Bureau for ...")

# now we have to add cross references:
foo['subdepartment'] = bar
foo.save()

bar['department'] = foo
bar['bureau'] = bureau
bar.save()

bureau['subdepartment'] = bar
bureau.save()

Line 7, 11, 14: When you change an Entity, Entry or Classifier after you have created it, you need to save them explicitly.

Alternatively you can only use ‘parent’ and ‘child’ as keys (instead of ‘department’, ‘subdepartment’ and ‘bureau’).

Loaders to import datasets into Open Spending¶

What are loaders¶

How to write a loader¶

Create a Loader¶

Read the data file¶

Save the Entries into the database¶

Dimensions - Describe the data for the users¶

Put it all together¶

Tests¶

Tips and Tricks¶

Default Society¶

Unique Keys/Names¶

Hierarchies¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Loaders to import datasets into Open Spending¶

What are loaders¶

How to write a loader¶

Create a Loader¶

Read the data file¶

Save the Entries into the database¶

Dimensions - Describe the data for the users¶

Put it all together¶

Tests¶

Tips and Tricks¶

Default Society¶

Unique Keys/Names¶

Hierarchies¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation