Getting Started

Please keep in mind that this software is under active development. It is stabilising with respect to the API interface but there are no guarantees! We are always happy to hear of the experiences of anyone who tries to get going with ORDF, stories of success or failure, but reports and suggestions, are encouraged via the usual channels.

Installation

ORDF is available from a mercurial repository:

http://ordf.org/src/

The usual way of installing is to use virtualenv and pip. Setting up a basic environment is easy:

% virtualenv /work
% . /work/bin/activate
(ordf)% pip install ordf

If you want to use the FuXi reasoner and/or RDFLib’s SPARQL support then you should now run:

(ordf)% pip uninstall rdflib
(ordf)% pip install rdflib==2.4.2

If you do not need reasoning and want to use another SPARQL implementation (e.g. 4store) then you should be able to run with the neweer 3.0.0 version of rdflib

If you are using 4store, make sure to install the branch that supports locking from http://github.com/wwaites/4store and the python bindings from http://github.com/wwaites/py4s

For development you can do this instead of simply installing the simpler ordf:

(ordf)% pip install mercurial
(ordf)% pip install -e hg+http://ordf.org/src/#egg=ordf

Once you have done this, you should have the ordf source checked out in /work/src/ordf.

Running the Tests

To run the tests, you have to installed the development branch. In the source directory /work/src/ordf do:

(ordf)% pip install nose
(ordf)% python setup.py nosetests --verbosity=2 -s

NOTE: to run the fourstore tests you need to have a 4store instance serving a kb called ordf_test. To do this:

% 4s-backend-setup ordf_test
% 4s-backend ordf_test

Also make sure to install py4s from http://github.com/wwaites/py4s This requires at least version 0.8 Also see http://wiki.github.com/wwaites/py4s/installing-py4s

NOTE: to run the rabbitmq tests, rabbitmq-server needs to be running with an exchange ordf_test that can be accessed by the user guest/guest.

Building Documentation

To build this documentation:

(myenv)% pip install sphinx
(myenv)% python setup.py build_sphinx

and the documentation will be in /work/src/ordf/build/sphinx/html/index.html

Configuration

There are two usual modes of operation for using ordf. The first is to use it as a library in another program, for example a Pylons application. The second is via an included command line program called ordf. There are example configurations in http://ordf.org/src/file/tip/examples/ simple.ini is good for getting quickly started with some persistent storage.

A Testing Environment

A very simple configuration for a Pylons application might be to just do any indexing and saving in the web request processing thread. This might not scale very well, particularly if adding a document to an index is a time consuming operation, but it is typical for a development environment. A fragment of an appropriate development.ini:

ordf.readers = pairtree
ordf.writers = pairtree,fourstore,xapian

pairtree.args = %(here)s/data/pairtree
fourstore.args = somekb
xapian.args = 127.0.0.1:44332

If you are using the FourStore back-end, it is important to use the locking 4store branch and to have the py4s bindings installed. Native RDFLib storage can also be used by putting rdflib in place of fourstore in ordf.writers and configuring it with:

rdflib.args = %(here)s/data/rdflib_sleepycat
rdflib.store = Sleepycat

The Xapian back-end is usually run as a network daemon using the xapian-tcpsrv command. This takes care of marshalling read and write operations to the database so that we don’t have to do it ourselves. It is possible to run directly from the filesystem but it is likely that you will experience locking errors if you try.

A Production Environment

A production installation will normally have a message queueing service such as RabbitMQ and a front-end interface will be configured to send messages to it. There will be several back-end storage modules that each listen to a queue and take any action required whenever a message arrives. Please refer to the RabbitMQ documentation for instructions for installing and setting up the queueing daemon.

It is important to distinguish between back-ends used for reading and for writing. For example, a typical configuration fragment from a Pylons application might be:

ordf.readers = pairtree,fourstore,xapian
ordf.writers = rabbit

pairtree.args = %(here)s/data/pairtree
fourstore.args = somekb
xapian.args = 127.0.0.1:44332

rabbit.hostname = localhost
rabbit.userid = guest
rabbit.password = guest
rabbit.connect.exchange = changes

In this setup, the various back-ends are set-up for read-only operation, but they are still available to the Handler singleton in the application. Any write operations, however, are sent to the message queue for processing.

Note that because PairTree is used for reading it is expected that the storage is available on the local disk or via NFS or some other mechanism. If any other indices need access to it and they are actually running on another host, suitable arrangements will need to be made.

Also the situation with respect to FourStore and Xapian described above in the context of a development environment is the same here.

At this point we have a Pylons application running and reading information from the back-ends and any write operations are sitting in a queue waiting to be processed. For each of the back-ends we need to make a configuration file and then use the ordf program to run them.

Taking the ordf.handler.rdf.PairTree back-end first, an appropriate configuration file for ordf might look something like:

[app:main]

ordf.handler = ordf.handler.queue:RabbitHandler
ordf.handler.hostname = localhost
ordf.handler.userid = guest
ordf.handler.password = guest
ordf.connect.queue = pairtree

ordf.writers = pairtree
pairtree.args = /some/where/data/pairtree

We would then run ordf like so:

ordf -c pairtree.ini -l /var/log/ordf/pairtree.log

A similar arrangement would be used for the other back-ends, the main difference being the ordf.writers directive and any arguments that the back-end requires.

Configuring Inferencing

Configuring inferencing is slightly complicated because it normally involves listening to one message exchange and writing to another. A configuration file for ordf might look like this:

[app:main]

ordf.handler = ordf.handler.queue:RabbitHandler
ordf.handler.hostname = localhost
ordf.handler.userid = guest
ordf.handler.password = guest
ordf.connect.exchange = reason
ordf.connect.queue = fuxi

ordf.readers = fuxi,pairtree
ordf.writers = fuxi,rabbit

fuxi.args = ordf.vocab.rdfs
pairtree.args = /some/where/data/pairtree
rabbit.hostname = localhost
rabbit.userid = guest
rabbit.password = guest
rabbit.connect.exchange = index

This takes a little explaining. There are two exchanges, reason and index. When a graph is saved, it is first sent to the reason exchange where ordf is listening with this configuration file.

The fuxi handler is an instance of ordf.handler.fuxi.FuXiReasoner and expects an already complete store containing one or more changesets and one or more up-to-date graphs that they modify. The fuxi.args is a comma-separated list of modules that export a inference_rules() method that return rules appropriate to that module. See the Inference Engines section of this manual.

When fuxi receives the store in its put() method it runs a production rule engine on all of the graphs that are not changesets. It then makes a changeset that contains any new statements it was able. It prevents the original changes from continuing to the rabbit handler and substitutes the changeset it has made together with the original changes.

In this way, there will normally be two changesets – the first containing the original changes and the second containing inferred statements.

It is not a problem that fuxi makes a changeset that may be passed to its own put() method whilst in that method since it is aware of this and simply returns without recursing and allows the rabbit handler to forward the combined changes to the index exchange.

In order to give a richer set of facts to feed the inference engine, fuxi needs access to other graphs that may be referenced by the original ones. For example, given this rule and data (not bothering with namespaces):

{ ?x :authorOf ?y } => { ?y :author ?x } .
:LevTolstoy :authorOf :WarAndPeace .

fuxi can be expected to produce the triple:

:WarAndPeace :author :LevTolstoy .

however the graph containing statements about :WarAndPeace may not be included in the changes. The fact that ordf.readers contains fuxi and pairtree in that order means that it will first look for the :WarAndPeace graph in fuxi and then try pairtree.

Examples

In addition to running the ordf command line tool to listen to a queue and update an index, it can be used for pulling a graph from the store, or saving a graph to the store. The following configuration file can be used in both cases:

[app:main]

ordf.readers = pairtree
ordf.writers = rabbit

pairtree.args = /some/where/data/pairtree
rabbit.hostname = localhost
rabbit.userid = guest
rabbit.password = guest
rabbit.connect.exchange = index

This uses the message queueing system but so long as there aren’t locking issues to consider it could just as easily use a list of writers as in the development environment above.

To retrieve a graph from the network or local filesystem and save it to the store:

ordf -c cmdline.ini -s -m "import from dbpedia" \
      http://dbpedia.org/resource/Margaret_Fuller

To print out the same graph in N3 format:

ordf -c cmdline.ini -t n3 http://dbpedia.org/resource/Margaret_Fuller

It is possible to use ordf to (re)build one or more indices. For example if there is data in a pairtree index and one decides to add 4store, a configuration file like this named mk4s.ini:

[app:main]

ordf.readers = pairtree
ordf.writers = fourstore

pairtree.args = /some/where/data/pairtree
fourstore.args = kbname

can be used with ordf run like this:

ordf -c mk4s.ini --reindex

to populate the new index. Only one reader may be specified in this circumstance, but any number of writers may be used as usual.