Loaders to import datasets into Open Spending ============================================= .. currentmodule:: wdmmg.lib.loader What are loaders ---------------- A loader is a way to import a :term:`dataset` into Open Spending. These loaders use one or more :term:`dataset sources` and create :term:`Entry`, :term:`Entity` and :term:`Classifier` entries in the :term:`database`. You can use whatever python can read,for example *csv*, *json*, *xls* or *xml* files. We also provide helper functions to read data directly from google docs. Note that we are working on a way to let users import *csv* files with through the Open Spending web interface. .. fixme: update status for csv importer How to write a loader --------------------- In this section you will write a *demo loader* to import a simple csv file that provides all the data we need. The minimal set of data you need for an import spendings is (named with their internal keys): ``to`` The recipient, which an :term:`Entity` (i.e. a source or sink for money). ``from`` The spender, also an :term:`Entity` . ``amount`` The amount spent ``time`` The time period to which the spending data refers. This could be years, month or days (but we usually analyze data per year. ) ``currency`` The currency the amount is quoted in. ``classifiers (optional)`` Information to classify a spending :term:`Entry`, e.g. the region department or sector it was spent on. This is not required but can be used to provide the user different ways to explore the data. Not all of this information has to be present in your data source. A common case is that for all spending data ``to`` or ``from `` is the society, or ``currency`` is alway the same, so you can hardcode values in your data loader. The :download:`demo source file <../wdmmg/tests/demoloader.csv>` is a csv file that contains one spending :term:`Entry` per line. You know that the ``currency`` of all spendings is *EUR*. The file contains the following columns: ``id`` A distinct id for every spending :term:`Entry` (a journal number/number of entry) ``spender_id`` A distinct id for the spender. ``spender_name`` The name of the spender. ``recipient_id`` A distinct id for the recipient. ``recipient_name`` The name of the recipient. ``region`` The region the money is spent on. ``sector`` The sector the money is spent on. ``date`` The day the money is spent on. ``amount`` The amount spent. Here is your data (:download:`download demoloader.csv<../wdmmg/tests/demoloader.csv>`): .. csv-table:: demoloader.csv :header: "id","spender_id","spender_name","recipient_id","recipient_name","region","sector","date","amount" "id","spender_id","spender_name","recipient_id","recipient_name","region","sector","date","amount" "demo-sp-001","dfes","Department for Education","dtlr","Department for the Regions","North Yorkshire","Social Protection","2010-01-01",1200000 "demo-sp-002","dfes","Department for Education","dtlr","Department for the Regions","North Yorkshire","Education","2010-02-01",800000 "demo-sp-003","dfes","Department for Education","society","General Public","North Yorkshire","Education","2011-03-01",500000 "demo-sp-004","dtlr","Department for the Regions","society","General Public","Hartlepool","Health","2011-04-01",1400000 "demo-sp-005","dtlr","Department for the Regions","dfes","Department for Education","Hartlepool","Heath","2010-05-01",260000 "demo-sp-006","dtlr","Department for the Regions","dfes","Department for Education","Hartlepool","Social Protection","2011-06-01",1150000 Create a Loader ''''''''''''''' .. testsetup:: demoloader from wdmmg.tests.demo import * First you create a :class:`Loader`. This is a class that lets you create :term:`Entry`, :term:`Entity` and :term:`Classifier` objects in an Open Spending site easily, quickly and reliably without a deep knowledge of the saved data structures. .. literalinclude:: ../wdmmg/tests/demoloader.py :pyobject: make_loader :linenos: * *Line 2-5*: We create the loader * ``dataset_name``: is the (internal) name of our dataset. This should be unique within the database. * ``unique_keys``: the key (or combination of keys) that can identify one spending :term:`Entry` uniquely. In this case it is enough to use the content of the id column. You will use that as the ``name`` of an entry. Do not use ``id`` on any *Dataset*, *Entry*, *Entity* or *Classifier* **ever**. ``id`` has a special meaning internally. * ``label`` The dataset label that you present to the user on the website. * ``description``: A description you can present to the user. It's not strictly required but we recommend to add 1-2 sentences. At the end you return the instanciated ``loader`` to use it later. Internally the loader automatically creates a :class:`wdmmg.model.Dataset` for you. .. doctest:: demoloader >>> loader = make_loader >>> loader.dataset.name u'demodata' See the :class:`Loader` API for the available methods and the options to create it. Read the data file '''''''''''''''''' Now you can read the csv file. To do that, create a function that you can pass the filename to (i.e. You don't need to write a second function to read a second, similar file, can use the same function multiple times). This is especially useful if you want to write tests for your data loader. .. literalinclude:: ../wdmmg/tests/demoloader.py :pyobject: read_data :linenos: * *Line 6*: Open the file * *Line 7*: Use it to create a :class:`csv.DictReader`. It will create a list with a ``dict`` for every line in the file. The first line, the headers, will be used as keys in the ``dict``. See the python documentation for details how it works. * *Lines 8-14*: Convert all values from the csv file to ``unicode`` values. Otherwise you may run into encoding errors later if the file contains non-ascii characters like french accent or german umlauts. * *Line 16*: Return the ``rows`` .. note:: If you want to work with massive amounts of data, you want to use generator functions in many places and ``yield`` the single rows for performance reasons. Now you have a list with one dict for every row: >>> filename = '../wdmmg/tests/demoloader.csv' >>> rows = read_data(filename) >>> first_row = rows.next() >>> first_row {'amount': '1200000', 'date': '2010-01-01', 'id': 'demo-sp-001', 'recipient_id': 'dtlr', 'recipient_name': 'Department for the Regions', 'region': 'North Yorkshire', 'sector': 'Social Protection', 'spender_id': 'dfes', 'spender_name': 'Department for Education'} Save the Entries into the database '''''''''''''''''''''''''''''''''' Now that you have the dictionaries with the data you can save it into the database. You have to refine the data a bit as the DictReader returns dicts where the keys are the column headers, and all values are strings. But: 1. The column ``id`` in the table will be the ``name`` of the *entry* you create. 2. The ``amount`` is a number. 3. ``date`` is a date. The main date of an *Entry* (the :term:`time axis` of the *dataset*) is treated specially. It is saved on the *Entry* with the key ``time`` and in a special data structure. 4. ``spender`` and ``recipient`` are not only text. In the *dataset* they are *Entities*. A :term:`Entity` is a person or a thing (e.g. an administrative region, a company etc) that can spend or receive money. *Entities* are not bound to a special *dataset*. E.g. one *entity* that we want to be unique across all datasets is the *society* (see :ref:`Default Society `) 5. ``region`` and ``sector`` are not only text. They can be used to *classify* each entry. Users will be able to navigate the data with the help of classifiers. A :term:`Classifier` is not bound to a dataset, it is part of a :term:`taxonomy`. E.g. the European Union has a system to classify spending entries of their Organisations and members (COFOG). Usually you reuse an existing *taxonomy* or create *one* custom to your *dataset*, but you could even create *Classifiers* in different *taxonomies*. Here we show you how to create custom ones. So let's prepare the data and save *Entry*, *Entities* and *Classifiers*. .. literalinclude:: ../wdmmg/tests/demoloader.py :pyobject: save_entry :linenos: * *Line 19*: Save the time in a special datastructure. :mod:`wdmmg.lib.times` contains helper functions to create it. * *Line 26, 31*: :meth:`Loader.create_entity` creates a :class:`wdmmg.model.Entity` object and saves it to the database automatically. If a *entity* with the same name exists already it will be updated. The *entity* object is returned so you can use it. * *Line 36, 38*: :meth:`Loader.create_classifier` creates a :class:`wdmmg.model.Classifier` object and saves it to the database automatically. If a *classifier* with the same *name* exists already in the same *taxonomy*, it will be updated. The *entity* object is returned so you can use it. * *Line 42, 43*: Use the *classifiers* to classify the *entry_data*. Additional information will be added to the dict. *name* to become the key of the classifier. This can be used to get more information about the meaning of the classifier. * *Line 46*: Finally create the :class:`wdmmg.model.Entry`. It will be saved automatically in the database but *create_entry* will not return an *Entry* object for performance reasons. It will return *query_spec* (a :term:`mongodb query spec`). If you have to do further work on the *entry* you can retrieve it with ``model.Entity.find_one(query_spec)``. Dimensions - Describe the data for the users '''''''''''''''''''''''''''''''''''''''''''' Sweet. The system can read your data and you can make a loader and save entries and other data. So you're nearly there. Most of the data you save is pretty standard. *Amount*, *sender*, *recipient*, *currency* and *date* are generally understandable and used across all datasets. But for the special data in the demo dataset, *region* and *sector*, you need to describe how users should interprete them. The system also need to know how a user should be able to analyze the data. To describe the kind of data you store in the Entries use Loader.create_dimension(). .. literalinclude:: ../wdmmg/tests/demoloader.py :pyobject: create_dimensions :linenos: Put it all together ''''''''''''''''''' To wire it all together, write a *load* function that finds the source file and does all the things above. It's important to call :meth:`Loader.compute_aggregates` at the end. This will collect additional information in the database. .. literalinclude:: ../wdmmg/tests/demoloader.py :pyobject: load :linenos: The complete loader can be :download:`downloaded here<../wdmmg/tests/demoloader.py>`. Loaders are normally used on the command line with the command:: paster load To add your loader to the load command, add an entry point to the python package you created in the *setup.py* for ``wdmmg.load``. In this case the demo loader is located in wdmmg/tests/demoloader.py and the function is ``load``: .. code-block:: python setup(... entry_points=""" [wdmmg.load] demo = wdmmg.tests.demoloader:load """ ) Tests ----- If you write a loader that is used by other people you should also write tests to make sure everything works as expected. There might be changes in newer versions of the source file or in newer versions of OpenSpending that will be unnoted if there are no tests. .. note: fixme: The tests are incomplete and should be extended. You can find a set of tests in :download:`test_demoloader.py <../wdmmg/tests/test_demoloader.py>` .. literalinclude:: ../wdmmg/tests/test_demoloader.py Tips and Tricks --------------- .. _defaultsociety: Default Society ''''''''''''''' Often you have *entities* that name the *Society* as the spender or recipient. You should not use ``create_entity()`` then but use :meth:`Loader.get_default_society`. Unique Keys/Names ''''''''''''''''' The names for *Entries* must be unique for the dataset, the name of *Classifiers* unique across all classifiers in the *taxonomy* and the name of *Entities* unique within an OpenSpending installation. This will most of the time be no problem. You often have unique id's like journal numbers for the *Entries*. If you do not have them, you can * use functions from :mod:`wdmmgext.load.utils` to create keys * prefix keys you have with the name of your dataset * For *Classifiers* create a new taxonomy specific for your dataset. Hierarchies ''''''''''' If you have a hierarchy, say: * Department: "Department Foo" * Subdepartment: "Subdepartment Bar" * Bureau "Bureau for extraordinary affairs" .. code-block:: python :linenos: foo = loader.create_entity('foo', "Department Foo") bar = loader.create_entity('bar', "Subdepartment Bar") bureau = loader.create_entity('extraordinary-affairs', "Bureau for ...") # now we have to add cross references: foo['subdepartment'] = bar foo.save() bar['department'] = foo bar['bureau'] = bureau bar.save() bureau['subdepartment'] = bar bureau.save() *Line 7, 11, 14*: When you change an *Entity*, *Entry* or *Classifier* after you have created it, you need to save them explicitly. Alternatively you can only use 'parent' and 'child' as keys (instead of 'department', 'subdepartment' and 'bureau').