Loaders to import datasets into Open Spending

What are loaders

A loader is a way to import a dataset into Open Spending. These loaders use one or more dataset sources and create Entry, Entity and Classifier entries in the database. You can use whatever python can read,for example csv, json, xls or xml files. We also provide helper functions to read data directly from google docs.

Note that we are working on a way to let users import csv files with through the Open Spending web interface.

How to write a loader

In this section you will write a demo loader to import a simple csv file that provides all the data we need. The minimal set of data you need for an import spendings is (named with their internal keys):

to
The recipient, which an Entity (i.e. a source or sink for money).
from
The spender, also an Entity .
amount
The amount spent
time
The time period to which the spending data refers. This could be years, month or days (but we usually analyze data per year. )
currency
The currency the amount is quoted in.
classifiers (optional)
Information to classify a spending Entry, e.g. the region department or sector it was spent on. This is not required but can be used to provide the user different ways to explore the data.

Not all of this information has to be present in your data source. A common case is that for all spending data to or from `` is the society, or ``currency is alway the same, so you can hardcode values in your data loader.

The demo source file is a csv file that contains one spending Entry per line. You know that the currency of all spendings is EUR. The file contains the following columns:

id
A distinct id for every spending Entry (a journal number/number of entry)
spender_id
A distinct id for the spender.
spender_name
The name of the spender.
recipient_id
A distinct id for the recipient.
recipient_name
The name of the recipient.
region
The region the money is spent on.
sector
The sector the money is spent on.
date
The day the money is spent on.
amount
The amount spent.

Here is your data (download demoloader.csv):

demoloader.csv
id spender_id spender_name recipient_id recipient_name region sector date amount
id spender_id spender_name recipient_id recipient_name region sector date amount
demo-sp-001 dfes Department for Education dtlr Department for the Regions North Yorkshire Social Protection 2010-01-01 1200000
demo-sp-002 dfes Department for Education dtlr Department for the Regions North Yorkshire Education 2010-02-01 800000
demo-sp-003 dfes Department for Education society General Public North Yorkshire Education 2011-03-01 500000
demo-sp-004 dtlr Department for the Regions society General Public Hartlepool Health 2011-04-01 1400000
demo-sp-005 dtlr Department for the Regions dfes Department for Education Hartlepool Heath 2010-05-01 260000
demo-sp-006 dtlr Department for the Regions dfes Department for Education Hartlepool Social Protection 2011-06-01 1150000

Create a Loader

First you create a Loader. This is a class that lets you create Entry, Entity and Classifier objects in an Open Spending site easily, quickly and reliably without a deep knowledge of the saved data structures.

1
2
3
4
5
6
7
def make_loader():
    '''return a loader for our demo dataset'''
    loader = Loader(dataset_name=u'demodata',
                    unique_keys=['name'],
                    label=u'Demo data set',
                    description="A dataset to show how to write a data loader")
    return loader
  • Line 2-5: We create the loader
    • dataset_name: is the (internal) name of our dataset. This should be unique within the database.
    • unique_keys: the key (or combination of keys) that can identify one spending Entry uniquely. In this case it is enough to use the content of the id column. You will use that as the name of an entry. Do not use id on any Dataset, Entry, Entity or Classifier ever. id has a special meaning internally.
    • label The dataset label that you present to the user on the website.
    • description: A description you can present to the user. It’s not strictly required but we recommend to add 1-2 sentences.

At the end you return the instanciated loader to use it later. Internally the loader automatically creates a wdmmg.model.Dataset for you.

>>> loader = make_loader
>>> loader.dataset.name
u'demodata'

See the Loader API for the available methods and the options to create it.

Read the data file

Now you can read the csv file. To do that, create a function that you can pass the filename to (i.e. You don’t need to write a second function to read a second, similar file, can use the same function multiple times). This is especially useful if you want to write tests for your data loader.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
def read_data(filename):
    '''read a file with data for our demo dataset
    and return a list of dicts. (To be corret it's a generator with
    which we can iterate over our dictionaries one at a time.'''

    csv_file = file(filename)
    rows = csv.DictReader(csv_file)

    # convert all strings to unicodes
    converted = []
    for row in rows:
        for key in row:
            row[key] = unicode(row[key], 'utf-8')
        converted.append(row)

    return converted
  • Line 6: Open the file
  • Line 7: Use it to create a csv.DictReader. It will create a list with a dict for every line in the file. The first line, the headers, will be used as keys in the dict. See the python documentation for details how it works.
  • Lines 8-14: Convert all values from the csv file to unicode values. Otherwise you may run into encoding errors later if the file contains non-ascii characters like french accent or german umlauts.
  • Line 16: Return the rows

Note

If you want to work with massive amounts of data, you want to use generator functions in many places and yield the single rows for performance reasons.

Now you have a list with one dict for every row:

>>> filename = '../wdmmg/tests/demoloader.csv'
>>> rows = read_data(filename)
>>> first_row = rows.next()
>>> first_row
{'amount': '1200000',
 'date': '2010-01-01',
 'id': 'demo-sp-001',
 'recipient_id': 'dtlr',
 'recipient_name': 'Department for the Regions',
 'region': 'North Yorkshire',
 'sector': 'Social Protection',
 'spender_id': 'dfes',
 'spender_name': 'Department for Education'}

Save the Entries into the database

Now that you have the dictionaries with the data you can save it into the database. You have to refine the data a bit as the DictReader returns dicts where the keys are the column headers, and all values are strings. But:

  1. The column id in the table will be the name of the entry you create.
  2. The amount is a number.
  3. date is a date. The main date of an Entry (the time axis of the dataset) is treated specially. It is saved on the Entry with the key time and in a special data structure.
  4. spender and recipient are not only text. In the dataset they are Entities. A Entity is a person or a thing (e.g. an administrative region, a company etc) that can spend or receive money. Entities are not bound to a special dataset. E.g. one entity that we want to be unique across all datasets is the society (see Default Society)
  5. region and sector are not only text. They can be used to classify each entry. Users will be able to navigate the data with the help of classifiers. A Classifier is not bound to a dataset, it is part of a taxonomy. E.g. the European Union has a system to classify spending entries of their Organisations and members (COFOG). Usually you reuse an existing taxonomy or create one custom to your dataset, but you could even create Classifiers in different taxonomies.

Here we show you how to create custom ones.

So let’s prepare the data and save Entry, Entities and Classifiers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
def save_entry(loader, row):
    '''prepare a ``row`` and save the *entry*, the *entities*
    and the *classifiers* in the database.
    '''

    # This will hold the data we save as the entry later
    entry_data = {}

    # 1. The name of the *entry* is the id
    entry_data['name'] = row['id']

    # 2. convert the amount to a float and add it to entry_data
    amount = float(row['amount'])
    entry_data['amount'] = amount

    # 3. convert the date into a time structure we need and add it
    #    as 'time'
    date = row['date']
    timestructure = for_datestrings(date)
    entry_data['time'] = timestructure

    # 4. Turn the spender and the recipient into wdmmg.model.Entity objects
    #    and add them as 'from' and 'to'
    spender_id = row['spender_id']
    spender_name = row['spender_name']
    spender_entity = loader.create_entity(spender_id, spender_name)
    entry_data['from'] = spender_entity

    recipient_id = row['recipient_id']
    recipient_name = row['recipient_name']
    recipient_entity = loader.create_entity(recipient_id, recipient_name)
    entry_data['to'] = recipient_entity

    # 5. create wdmmg.model.Classifier objects for region and sector...
    region_name = row['region']
    region_classifier = loader.create_classifier(region_name, u'demotaxonomy')
    sector_name = row['sector']
    sector_classifier = loader.create_classifier(sector_name, u'demotaxonomy')

    # ... and classify the entry. The necessarey information will be added
    #     to entry_data
    loader.classify_entry(entry_data, region_classifier, name=u'region')
    loader.classify_entry(entry_data, sector_classifier, name=u'sector')

    # 6. save the entry
    query_spec = loader.create_entry(**entry_data)
  • Line 19: Save the time in a special datastructure. wdmmg.lib.times contains helper functions to create it.
  • Line 26, 31: Loader.create_entity() creates a wdmmg.model.Entity object and saves it to the database automatically. If a entity with the same name exists already it will be updated. The entity object is returned so you can use it.
  • Line 36, 38: Loader.create_classifier() creates a wdmmg.model.Classifier object and saves it to the database automatically. If a classifier with the same name exists already in the same taxonomy, it will be updated. The entity object is returned so you can use it.
  • Line 42, 43: Use the classifiers to classify the entry_data. Additional information will be added to the dict. name to become the key of the classifier. This can be used to get more information about the meaning of the classifier.
  • Line 46: Finally create the wdmmg.model.Entry. It will be saved automatically in the database but create_entry will not return an Entry object for performance reasons. It will return query_spec (a mongodb query spec). If you have to do further work on the entry you can retrieve it with model.Entity.find_one(query_spec).

Dimensions - Describe the data for the users

Sweet. The system can read your data and you can make a loader and save entries and other data. So you’re nearly there.

Most of the data you save is pretty standard. Amount, sender, recipient, currency and date are generally understandable and used across all datasets. But for the special data in the demo dataset, region and sector, you need to describe how users should interprete them. The system also need to know how a user should be able to analyze the data.

To describe the kind of data you store in the Entries use Loader.create_dimension().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def create_dimensions(loader):
    '''Describe which extra data we save beside the common data.'''

    loader.create_dimension(
        'sector', label='Sector',
        description=('The sectors the money is spend in as described in '
                     'catalog FOO published by the department BAR. '
                     'It should be mostly accurate it is possible '
                     'that it was repurposed by the recipient.'))

    loader.create_dimension(
        'region', label="Regional",
        description=("The money was appropriated to the be used in "
                     "the named region. It will have be given to a number "
                     "of regional buereaus. However some these regional "
                     "bureaus have special functions that are not related "
                     "to the region."))

Put it all together

To wire it all together, write a load function that finds the source file and does all the things above. It’s important to call Loader.compute_aggregates() at the end. This will collect additional information in the database.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def load():
    '''load the demo data into Open Spending'''
    directory = os.path.dirname(__file__)
    filename = os.path.join(directory, 'demoloader.csv')

    loader = make_loader()
    create_dimensions(loader)

    rows = read_data(filename)
    for row in rows:
        save_entry(loader, row)

    loader.compute_aggregates()

The complete loader can be downloaded here.

Loaders are normally used on the command line with the command:

paster load <loader_name>

To add your loader to the load command, add an entry point to the python package you created in the setup.py for wdmmg.load. In this case the demo loader is located in wdmmg/tests/demoloader.py and the function is load:

setup(...
      entry_points="""
      [wdmmg.load]
      demo = wdmmg.tests.demoloader:load
      """
      )

Tests

If you write a loader that is used by other people you should also write tests to make sure everything works as expected. There might be changes in newer versions of the source file or in newer versions of OpenSpending that will be unnoted if there are no tests.

You can find a set of tests in test_demoloader.py

'''This are the tests for the demoloader that is used in
the loader howto in /doc/loader.rst
'''

from unittest import TestCase

from wdmmg.model import init_serverside_js, Repository
from wdmmg.model.mongo import db, drop_db
from wdmmg.lib.loader import Loader


class TestDemoLoader(TestCase):

    def setUp(self):
        # create the collections so we have the
        # default indexes when we start
        Repository().delete_all()
        self.db = db()
        self.db.create_collection('entry')
        self.db.create_collection('entity')
        # upload server side javascript
        init_serverside_js()
        # fixme: we have to ease the tests for new loaders

    def tearDown(self):
        Repository().delete_all()
        drop_db()

    def test_create_loader(self):
        from wdmmg.tests.demoloader import make_loader
        loader = make_loader()
        self.assertTrue(isinstance(loader, Loader))
        self.assertEqual(loader.dataset.name, 'demodata')

    def test_read_data(self):
        from os.path import dirname, join
        from wdmmg.tests.demoloader import read_data

        directory = dirname(__file__)
        filename = join(directory, 'demoloader.csv')
        rows = read_data(filename)
        self.assertEqual(len(rows), 6)
        self.assertEqual(rows[0],
                         {'amount': '1200000',
                          'date': '2010-01-01',
                          'id': 'demo-sp-001',
                          'recipient_id': 'dtlr',
                          'recipient_name': 'Department for the Regions',
                          'region': 'North Yorkshire',
                          'sector': 'Social Protection',
                          'spender_id': 'dfes',
                          'spender_name': 'Department for Education'})

    def test_complete_run(self):
        from wdmmg.model import Entry
        from wdmmg.tests.demoloader import load

        load()
        # test that we can find one of our entries
        self.assertTrue(Entry.find_one({'dataset.name': 'demodata'}))

Tips and Tricks

Default Society

Often you have entities that name the Society as the spender or recipient. You should not use create_entity() then but use Loader.get_default_society().

Unique Keys/Names

The names for Entries must be unique for the dataset, the name of Classifiers unique across all classifiers in the taxonomy and the name of Entities unique within an OpenSpending installation.

This will most of the time be no problem. You often have unique id’s like journal numbers for the Entries. If you do not have them, you can

  • use functions from wdmmgext.load.utils to create keys
  • prefix keys you have with the name of your dataset
  • For Classifiers create a new taxonomy specific for your dataset.

Hierarchies

If you have a hierarchy, say:

  • Department: “Department Foo”
    • Subdepartment: “Subdepartment Bar”
      • Bureau “Bureau for extraordinary affairs”
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
foo = loader.create_entity('foo', "Department Foo")
bar = loader.create_entity('bar', "Subdepartment Bar")
bureau = loader.create_entity('extraordinary-affairs', "Bureau for ...")

# now we have to add cross references:
foo['subdepartment'] = bar
foo.save()

bar['department'] = foo
bar['bureau'] = bureau
bar.save()

bureau['subdepartment'] = bar
bureau.save()

Line 7, 11, 14: When you change an Entity, Entry or Classifier after you have created it, you need to save them explicitly.

Alternatively you can only use ‘parent’ and ‘child’ as keys (instead of ‘department’, ‘subdepartment’ and ‘bureau’).