×

Welcome to Underverse!

Underverse is named after the destination and afterlife of the Necromonger’s from the Chronicles of Riddick. This module is so named because the author likes the movie.

The Underverse is an Object Document Mapper, or ODM. An ODM performs roughly the same functions for unstructured data as an ORM (Object Relational Mapper) does for relational data minus the ‘relationships’...

In Python, SQLAlchemy is the ORM of choice. The Underverse follows the same design principles seen in SQLAlchemy. So if you’ve used it before, this module should feel right at home.

Note

Underverse is now beta due to the arrival of v0.3. However, there may still be uncovered bugs or unexpected errors waiting to be found. I’m continually working to increase the test coverage and to update the docs.

If bugs are found, please join the Google Group and post the issue. Suggestions are also welcome.

Warning

Underverse underwent a tremendous maturation process between versions 0.2.3 and 0.3 and again between 0.3.6 and 0.4.0. Therefore, if you downloaded an earlier version, it would greatly benefit you to upgrade to the latest version.

If you are using any mapreduce functionality, version 0.4.0 will likely fix any issues you have found.

GitHub

This module can now be found on GitHub.

What is Underverse?

At it’s core, the Underverse is a JSON-based document store built on top of SQlite. JSON was chosen over pickle because of it’s size, simplicity, readability and speed. These features, combined with SQlite, make for a simple and elegant solution. Everything in the Underverse is written to use Python’s generators to provide for a streamlined process for data access and manipulation. The module provides ways to quickly load, update, group, map and filter data.

To avoid some mis-conceptions, here’s what it is and what it’s not:

Underverse is:

  • Fast, easy, simple and clean
  • A zero-configuration solution
  • Extremely light-weight
  • Specialized
    • Think of a Knife, not a Low Orbit Ion Cannon
    • Agile, not encumbered by over-bearing features
  • Lean, very lean
  • A wicked-sick, data-analysis module
  • NOT a do-everything module
  • NOT into keeping up with the Joneses
  • NOT a NoSQL client-server solution
    • if you want one, you’d have to wrap it yourself (or use one that already works and save yourself a headache)
  • NOT a full NoSQL solution (at least not in the traditional sense)
    • it has no indexes
    • it has no built-in replication
    • it has no sharding-capability
  • NOT a solution for anyone with high-availability needs (again, not a client-server solution)
    • However, if an on-disk SQlite database is good enough for you, awesome.

The Underverse was designed to be an analysis module. It was engineered to be a defense for low-strafing questions on Friday @ 445 PM. It was built to answer questions FAST. With this in mind and by using NoSQL / Post-modern document storage principles, the Underverse can evolve over time allowing for increasingly deeper questions to be asked of the data it houses.

With this in mind, Underverse has now reached a stable enough development status that it could be used as the backend for websites and other data storage platforms.

“You keep what you kill.” - the late Lord Marshal

Introduction

These next few sections outline possible points of familiarity for those who have worked with relational databases or other NoSQL data stores. I’ve also included a section comparing Underverse and SQL-Alchemy. As I’ve said before, if you’ve used SQL-Alchemy before then you should see some similarities.

Organization / Lingo

Here are the relationships between the objects found in Underverse and a traditional RDBMS, as well as their counter-parts using NoSQL-speek:


RDBMS NoSQL-speek Underverse
Database Database Underverse
Table Collection Verse
ResultSet - SubVerse
ResultRow Document NecRow

SQL-Alchemy Comparison

Underverse’s query syntax is similar to SQLAlchemy’s. However, Underverse removes some of the complexity of getting started.

But before I go any further, understand that Underverse is NOT meant to battle SQL-Alchemy in hand-to-hand combat. There are many, many things that SQL-Alchemy does that are top-notch in the ORM world, most of which will never be covered in Underverse. I have really enjoyed using SQL-Alchemy in the past. However, some, perhaps even most, projects can get away with using a much lighter-weight approach to managing data. Underverse is meant to fill that niche.

Here’s a comparison for those of you who are familiar to SQL-Alchemy:

SQLAlchemy:

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import scoped_session, sessionmaker

Base = declarative_base()

class User(Base):
    __tablename__ = 'users'

    id = Column(Integer, primary_key=True)
    name = Column(String)
    fullname = Column(String)
    password = Column(String)

    def __init__(self, name, fullname, password):
        self.name = name
        self.fullname = fullname
        self.password = password

    def __repr__(self):
       return "<User('%s','%s', '%s')>" % (self.name, self.fullname, self.password)

session = scoped_session(sessionmaker(bind=create_engine('sqlite:///:memory:', echo=False)))

ed_user = User('ed', 'Ed Jones', '3d5_p455w0r6')
session.add(ed_user)

Let’s take a look at everything that’s happening. Before you can do anything you have to define a data model. In the example case, the User class is defined with first name, full name and password fields.

Next, a session is created to operate on. Then a User object is instantiated and finally added to the session.

Underverse:

from underverse import Underverse

uv = Underverse()
users = uv.users

ed_user = { 'name': 'ed', 'fullname':'Ed Jones', 'password':'3d5_p455w0r6'}
users.add(ed_user)

Ok, at first glance you can see a large difference in the lines of code required. Let’s talk about what Underverse is doing.

When uv = Underverse() is called, a connection to an in-memory SQLite database is made behind the scenes. Then a table is created by using users = uv.user. No class structure or data model is needed to insert data. Then to insert data, you simply call users.add(ed_user).

However, with the advent of v0.3, class objects are officially supported. Therefore, you can also do this.

uv = Underverse()
test = uv.test

class User(object):
  def __init__(self, name, fullname, password):
      self.name = name
      self.fullname = fullname
      self.password = password

ed_user = User('ed', 'Ed Jones', '3d5_p455w0r6')
test.add(ed_user)

Storing class instances are as painless as can be. There is NO need for a base class (unless you have need of one, of course).

Note

Yes, I understand that SQL-Alchemy is an ORM and Underverse is not. The above was simply to show the differences of loading data.

Please, don’t send me hate mail.

Querying Preview

Using the same SQL-Alchemy code above, here’s how querying is handled in both frameworks. Again, this is NOT a replacement for SQL-Alchemy, but rather a lighter-weight approach data storage and retrieval.

SQL-Alchemy:

session.query(User).filter(User.fullname=='Ed Jones')

This statement finds all rows in the users table where fullname == 'Ed Jones'.

Similarly Underverse does this:

from underverse.model import Document
users.find(Document.name == 'ed')

When using Underverse, the Document class is the main query object. It handles all document queries for every document collection. There’s no need to create a class for every table or collection.

Additional Key-Value Functionality

Underverse now has basic key-value storage methods. This can be used for many things, however, perhaps storing script options might be useful along with using the module for it’s document storage capabilities as well.

Here’s a preview of the key-value functionality. The KeyValue class will have a more detailed description of the existing capability.

uv = Underverse()
options = uv.options

# Add key by using the ``put`` method
options.put('key', 'value')

# Get the associated value
print options.get('key')

#This raises a KeyError.
print options.get('key2')

MapReduce Functionality

Note

v0.4.0 is the version of choice for MapReduce operations. This version contained several bug fixes and added capability for MapReduce.

Underverse has a powerful grouping capability built-in, however, if you have need of more specific control over what is grouped or aggregated, then MapReduce is for you.

This module has 3 main functions for MapReduce:

  • map: aggregates documents based on a function
  • reduce: calls a reduce function for each group returned from a map
  • simple_reduce: simply calls a function to operate on the entire dataset

For convenience, Underverse also has a mapreduce function. This chains the map and reduce functions together for ease of use.

See also

Click on the links above to see examples.

Module Documentation

class underverse.Underverse(filename=None)

This class is the core of the module. It provides the interface for either an in-memory or on-disk SQlite database. In-memory databases are noticably faster than files because of disk IO. However, this module provides functionality to both dump an in-memory database and load data into one.

Getting started is as easy as:

# importing the module
from underverse import *

# creates a connection to the ether
uv = Underverse()

An on-disk data store can be created like so:

uv = Underverse('helion.db')

Note

The extension doesn’t matter, but traditionally SQlite databases are saved as either .db or .sqlite.

static add_json_ext(encoder, decoder, clazz)

Allows user to define custom JSON encoding and decoding.

Note

This is an advanced feature. This is only useful if you want certain objects to be encoded in a custom way. Most users will probably never need to use this functionality.

By using jsonpickle the object encoding should be much more robust and simpler.

# Here's a possible use case:
# The following code transforms a NumPy array into a JSON list and back again
# during the encoding and decoding process

# the object is a numpy.ndarray; it's converted to a Python list before encoding
numpy_encoder = lambda obj: obj.tolist()

# the list is converted back to a numpy.ndarray object when decoding
numpy_decoder = lambda obj: np.array(obj)

Underverse.add_json_ext(numpy_encoder, numpy_decoder, np.ndarray)

User Defined Encoders:

With the advent of version 0.4.0, the object encoding and decoding process became much more simpler. Instead of a home-grown solution for object encoding, I have opted to use jsonpickle.

Now, encoders must be a callable object, such as a lambda or function.

User Defined Decoders:

Decoders must also be a callable objects.

Warning

The object encoding and decoding changed several times between versions 0.3.0 and 0.4.0. This functionality should be stable now that a good solution has been found by using jsonpickle.

close()

Closes the database connection

uv.close()
static create_mappers(*_cls)

This function is now defunct as of v0.4.0.

jsonpickle is now handling the object encoding and decoding. Python objects should work automatically, however, if it doesn’t use either add_json_ext OR the new register functionality.

dump(filename)

Dumps the underverse to a .sql file which can be loaded in the future.

uv.dump("backup.sql")
load(filename)

Loads a persited underverse. Using this method to populate a collection is substantially faster than loading all the data using the add method.

uv.load("backup.sql")
new(name)

You can also create a table like this:

#this creates a new collection called 'data'
verse = uv.new('data')

#perhaps an easier and more elegant method is the following
table2 = uv.table2

The above statements create a two-column table in the SQlite data store.

static register(clazz, handler)

This registers a custom object handler with jsonpickle.

Numpy arrays are handled by default, however, this is how it works.

# import handlers
from underverse import handlers
import jsonpickle

# create a custom handler for the encoding and decoding of NumPy arrays
class NumpyHandler(jsonpickle.handlers.BaseHandler):

  # 'flatten' encodes an object into a JSON compatible object
  # In this example, the NumPy array is converted to a list before encoding
  def flatten(self, obj, data):
    data['__numpy__'] = obj.tolist()
    return data

  # 'restore' has to perform the opposite function as 'flatten'
  # Here, a list is converted back to a NumPy array
  def restore(self, obj):
    import numpy as np
    return np.array(obj['__numpy__'])

# now the handler is registered
Underverse.register(np.ndarray, NumpyHandler)

# testing the encoding
pickled = jsonpickle.encode(np.arange(5))
print pickled               # {'py/object': 'numpy.ndarray', '__numpy__': [0, 1, 2, 3, 4]}

# testing the decoding
unpickled = jsonpickle.decode(pickled)
print unpickled               # [0 1 2 3 4]
class underverse.Verse(connection, name)

A Verse is a class which represents a collection of similar data. If you’re familiar to traditional relational databases, it’s like a table. It can also be described as a document collection. In the background, it’s actually a SQlite table. This class allows for standard CRUD (create, read, update and delete) operations.

uv = Underverse()

#there are two ways to create a new document collection
members = uv.members

#or...
accounts = uv.new("accounts")

The code above creates a new collection if it doen’t currently exist or connects to an existing collection.

add(necro)

Adds a NecRow or a list of NecRows to the database.

uv = Underverse()
table = uv.helion

#you can either add one row at a time
table.add({'a':1, 'b': 2})

# or do bulk inserts
array = [
  {'a':1,'b':2,'c':3},
  {'a':4,'b':5,'c':6},
  {'a':7,'b':8,'c':9}]

table.add(array)

Performance Hint

Bulk inserts are noticibly faster

add_column(array, name, commit=True)

Adds an array with column names to the dataset.

Note

The given array must be the same size as the entire collection. If this doesn’t fit with what you are trying to do, just use the included update functionality.

all()

Returns a SubVerse object containing all objects in the verse / collection

find(*filters)

Searches verse for documents that meet all the predicates given. A SubVerse instance is returned containing all the found documents. This is how querying is handled in the Underverse.

Usage:

#create in-memory database
uv = Underverse()

# this loads a previous dump of an Underverse instance (`uv.dump("backup.sql")`)
uv.load("backup.sql")

#Document instances are the 'query' objects in the Underverse.
#Unlike SQLAlchemy, each table doesn't need to have a model.
#The Underverse uses one model object for all tables...
#The 'country_code' is the attribute (or 'column' in traditional RDBMS) you are searching for.
#In this case, the coder is searching for all members in the US.
for person in uv.members.find(Document.country_code == "US"):
  print person

  #calculate stats
  #...

#closes the connection
uv.close()

Look at the Document documentation for all supported query operators.

find_one(*filters)

Searches verse for the first document that is true for all predicates given

from_array(array, names, commit=True)

Adds an array with column names to the dataset.

uv = Underverse()
table = uv.collection

array = [[1,2,3],[4,5,6],[7,8,9]]

table.from_array(array, names=['x', 'y', 'z'])
get(key)

This gets a document based on the key or UUID.

from underverse import Underverse, KeyValue

uv = Underverse()
options = uv.options

options.put('key', 5)

option = options.get('key')
print option.value

# Key-Value pairs can also be used as other documents as well...
for o in options:
  print o
  o.value2 = o.value * 2
  options.update(o)

# Basically, the following line is doing the same thing as a KeyValue instance
# n = NecRow({'uuid':'key', 'value':5})
glob(globs, case_sensitive=True, operand=' and ')

Searches verse using SQlite GLOB statements. This is used to limit the number of documents being queried for.

By importing and_ or or_ from underverse, you can perform more specific glob operations.

Example usage:

For example, say you have an unstructured CSV (comma separated values) file with network traffic. The file has several different types of logging messages with both received and transmitted messages. You need to search for a specific logging message, but there’s a lot of data and converting all the extra messages would be pointless and time consuming. Here’s how you would filter out the extra messages and then group by IP address.

uv = Underverse()
uv.load("logs.sql")
logs = uv.logs

#finds all WARNING messages that were received
warnings_received = logs.glob('rxnet', 'WARNING')

# groups the warnings by IP
for ip, warnings in warnings_received.groupby("ip"):
  print '\nIP Address: ' + str(ip)

  for warning in warnings:
    print ' - Description: ' + str(warning.desc)

Performance Hint

Use this whenever you can to gain a slight performance boost. Because this function uses SQlite’s GLOB functionality, the documents can be searched before they are converted to python dictionary.

However, querying and globbing speed can be increased by intellegently ordering query parameters. In Underverse, if one condition fails, then the record is skipped without processing the other conditions. Therefore, use the more strict conditions first.

# globbing - 2.297s
warnings_received = logs.glob('WARNING', 'rxnet')

# faster globbing - 1.141s
warnings_received = logs.glob('rxnet', 'WARNING')

# querying - 2.313s
warnings_received = logs.find(D.Message_Label == "WARNING", D.rx_tx_net == 'rxnet')

# faster querying - 2.218s
warnings_received = logs.find(D.Message_Label == "WARNING", D.rx_tx_net == 'rxnet')

The code above searched 50k+ network messages and the times shown are the min of 3 runs.

groupby(*attrs)

The grouping of similar data is essential to most data analysis operations. This function is similar in nature to the map function with perhaps more readable syntax.

The groupby function doesn’t have as much power as the map function, however, this function aggregates data based on one or more attributes.

The map capability allows for more freedom with the possibility of calculated key-value pairs.

uv = Underverse()

members = uv.members

#insert data
#...
members.add(list_of_ppl)

for state, inhabitants in members.groupby('state'):
  print "State: %s" % state
  print " - Population: %s" % len(inhabitants)

The code above will print all the states that have any members as it’s citizens. The ‘inhabitants’ variable is a SubVerse instance containing all the citizens for the given state.

You can group by more than one column as well, such as members.groupy('state', 'county').

join(right, alias)

This performs join operations on the current Verse.

limit(count)

Returns a user-defined limited number of documents.

docs = uv.data.limit(50)

Or..

docs = uv.data.find(Document.age > 25).limit(50)

Note

You can also chain the limit and skip functions together.

map(function)

Calls a map function on the entire collection.

Mapper functions must or yield a two-value tuple. This tuple represents a key-value pair. The key will be used to aggregate similar data. The value can be anything, but remember you will have a list of these values for every key.

#groups documents by name and age
def name_mapper(array):

  # iterates through all the documents and yields name, age and the entire document
  # depending on what you are trying to accomplish, the document may not need to be returned
  # returning a '1' as the value instead of the document could be used with the 'sum' function to count as well
  for doc in array:
    yield (doc.name, doc.age), doc

# iterates through the map results
for name, age, ppl in uv.test.map(name_mapper):

  # ppl is a list of documents with the same name and age
  print name, age, len(ppl)

Note

Bug fixes in v0.4.0

mapreduce(mapper, reducer, expand=True, sort=False)

Calls map and reduce functions on the entire collection.

#groups documents by name and age
def name_mapper(array):

  # iterates through all the documents and yields name, age and the entire document
  # depending on what you are trying to accomplish, the document may not need to be returned
  # returning a '1' as the value instead of the document could be used with the 'sum' function to count as well
  for doc in array:
    yield (doc.name, doc.age), doc

# iterates through the mapreduce results
for name, age, count in test.mapreduce(name_mapper, len):

  print name, age, count

The code above aggregates the names and ages together, then calls len on the grouped data.

The expand option expands the keys from a map operation into a tuple for easy access along with the data it grouped.

The sort option has five possible values:

  • True: uses the built-in Python sorted function to sort the results
  • map: sorts the results on the mapped keys
  • reduce: sorts the results on the reduced values
  • -map: sorts the results on the mapped keys in reverse
  • -reduce: sorts the results on the reduced values in reverse

The desc or negative options allow for operations such as Top 5 most frequent ....

orderby(*attrs)

Orders the collection by one or more columns

uv.docs.orderby('name', '-age')

The orderby functionality now has ASC and DESC capability. Descending order is achieved by pre-pending a ‘-‘ (negative sign or hyphen) to the column name.

However, you can also do this:

uv.docs.find(D.orderby('name', '-age'))
paginate(count)

Pages the collection.

If a collection has 5000 documents, calling verse.paginate(500) will return 10 pages of 500 documents each.

purge(vacuum=False)

Removes all data in collection.

If you are using an on-disk database, you will notice than after the data has been purged, the size on disk is not smaller. This is due to SQLite’s handling of the data structures. The storage space is left to be used by future documents.

To remove the data from the disk permanently, use the vacuum option.

verse.purge(True)

Note

Vacuuming the database is an expensive task. You might want to use it sparingly.

purify(*cols)

Converts the given attributes to a NumPy recarray. This can provide for an easy transition to NumPy.

Warning

The ‘purify’ function requires that NumPy is installed.

Note

Design History:

Just as an FYI, before Underverse was started, I worked on another analysis module called bops. bops has similar functionality as Underverse, however, it is limited to rectangular datasets. Bops used NumPy as it’s foundation and was blazing fast because of it. However, it was also a memory hog. Many of the design features seen in Underverse were originally ‘fleshed out’ in bops.

I have rewritten the core of bops and added it as an extension to Underverse. It is called the QuasiDead.

This extension allows for a subset of a Verse to be shoved temporarily into NumPy to give the coder access to the added functionality NumPy provides. In order for this to work, the chosen attributes are sliced from the document collection and molded into a structured data set which NumPy can handle.

Each QuasiDead instance can also be grouped, ordered and filtered.

Performance Hint

‘Purifying’ data does not result in any noticeable speed increase in the grouping, filtering and sorting of data. To the contrary, it’s been my experience that Python’s built-in generators outperform NumPy. This is mainly due to the cost of creating a recarray from the document store. Therefore, the only reason to ‘purify’ your data would be to make use of some of NumPy’s functionality.
uv = Underverse()
uv.load('data.sql')

data = uv.data

# creates a numpy recarray with 4 columns captured from the documents in the 'data' collection
qd = data.purify('name', 'age', 'gender', 'country')

# the new QuasiDead instance has limited functionality outside of it's code ``numpy.recarray``.
# However, grouping and ordering can be achieved the same way as a document collection in Underverse.
for country, citizens in qd.groupby('country'):
  print country, len(citizens)

# numpy string array
names = qd.name

See also

Please look at the QuasiDead documentation for more information on what this unique class can be used for.

put(key, value)

This adds a KeyValue instance to the collection.

uv = Underverse()
options = uv.options

# Add key by using the ``put`` method
options.put('key', 'value')

Note

Enumerations

Enumerations can be created easily by using the key-value pair functionality.

uv = Underverse('data.db')
options = uv.options

# ENUMs can be created like this
options.put('DEVMODES', {'TEST':0,'DEV':1,'PROD':2})

# ENUMs can be retrieved like this
DEVMODES = options.get('DEVMODES')
print DEVMODES.PROD

# Or like this.
DEV = options.get('DEVMODES.DEV')
reduce(mapper_results, reducer, expand=True, sort=False)

This function ‘reduces’ the data returned from each map group. Reducers are meant to return a single value per group. However, due to python’s typing you can return a list, dictionary or tuple because they are objects themselves.

There are severl points to make about the code above. First, map can be stored for later, meaning multiple reducers can run on the same map results. Second, all reduce and``mapreduce`` functions have both expand and sort options.

The expand option expands the keys from a map operation into a tuple for easy access along with the data it grouped.

The sort option has five possible values:

  • True: uses the built-in Python sorted function to sort the results
  • map: sorts the results on the mapped keys
  • reduce: sorts the results on the reduced values
  • -map: sorts the results on the mapped keys in reverse
  • -reduce: sorts the results on the reduced values in reverse

The descending or negative options allow for operations such as Top 5 most frequent ....

Note

Bug fixes in v0.4.0

remove(necro)

Removes a documents or a list of documents.

This function operates just like the ‘add’ and ‘update’ operations. It accepts either a single document or a list of documents.

simple_reduce(function)

Calls a reduce function on the entire collection.

Note

Bug fixes in v0.4.0

skip(count)

Returns all documents after the user-defined number have been skipped.

docs = uv.data.skip(50)

Or..

docs = uv.data.find(Document.age > 25).skip(50)

Note

You can also chain the limit and skip functions together.

unique(*attrs)

Finds all the unique values for one or more columns

uv.verse.unique('name', 'age')
update(necro)

Updates a document or a list of documents.

One possible usage..

Say you have a website where members can comment on blog posts. You want to run some stats on how many posts each user has commented on. But when you were building your site you didn’t think about that...

Let’s say that you have already loaded your members into an Underverse instance. Being the brilliant coder you are, you have also given each member a list of their comments as well. You just want to add a comment count. Here’s one way to do it.

# connect to SQlite database
uv = Underverse("web.db")

#select members table
members = uv.members

#update every member in collection
for member in members:

  #assuming each member has a list of comments under member.comments
  #the same syntax is used for adding a new attribute of updating an existing attribute
  member.comment_count = len(member.comments)

  #update document: collection.update(document)
  members.update(member)

Bulk operations are normally faster. This is how you could do the same thing using bulk updates.

uv = Underverse("web.db")
members = uv.members

#create a list of documents to be updated
updated = []

#update every member in the collection
for member in members:

  #assuming each member has a list of comments under member.comments
  #the same syntax is used for adding a new attribute of updating an existing attribute
  member.comment_count = len(member.comments)

  # add member to the list to be updated
  updated.append(member)

  # incremental updates can be achieved like so
  # updating every 2500 documents
  if len(updated) % 2500 == 0:
    members.update(updated)

    #don't forget to clear the queue
    updated = []

# commit any remaining documents in queue
members.update(updated)

Performance Hint

Bulk updates are faster because SQlite is handling more of the work. Any time you can limit function calls, do it.
class underverse.SubVerse(division)

If you are familiar to database programming, you can think of SubVerse as something similar to a ResultSet object. Anytime a query is made of a Verse (or collection) object, this is what is returned. Therefore, this object contains most of the logic in terms of filtering, grouping and mapping data.

However, you can also use it without having to load data into an Underverse instance. It will also operate on a list of dicts out of the box.

>>> dicts = [{'name':'ed', 'fullname':'Ed Jones', 'password':'3d5_p455w0r6'},
...     {'name':'wendy', 'fullname':'Wendy Williams', 'password':'foobar'},
...     {'name':'mary', 'fullname':'Mary Contrary', 'password':'xxg527'},
...     {'name':'fred', 'fullname':'Fred Flinstone', 'password':'blah'}]

>>> for user in SubVerse(dicts).find(Document.name == 'ed'):
... print user
{'fullname': 'Ed Jones', 'password': '3d5_p455w0r6', 'name': 'ed'}

Notice that the found document doesn’t have a UUID, created_at or updated_at keys. This is because it was never loaded into SQlite.

find(*filters)

Filters a list of NecRows based on input conditions searches verse for true predicates

See also

You can look at the find function for Verse to get an example on how to query.

find_one(*filters)

Executes the same way as the find function, but only returns the first document.

Returns None if no document is found matching conditions

See also

Look at the find_one function for Verse to see an example.

groupby(*attrs)

Grouping data can be extremely powerful in data analysis. Therefore, data grouping and aggregation in Underverse does not hold back nor does it disappoint.

See also

Please look at the groupby functionality for a Verse for a more comprehensive usage explanation.

join(right, alias)

This performs join operations on the current SubVerse.

limit(count)

Returns a user-defined limited number of documents.

map(function, wrap=True, expand=True)

Calls a map function on the subverse.

Mapper functions must or yield a two-value tuple. This tuple represents a key-value pair. The key will be used to aggregate similar data. The value can be anything, but remember you will have a list of these values for every key.

See also

Please look at the map functionality for a Verse for a more comprehensive usage explanation.

Note

Bug fixes in v0.4.0

mapreduce(mapper, reducer, expand=True, sort=True)

This function calls the map and reduce functions and returns the results as a dictionary.

However, if the expand option is True, the results will be expanded into a list. This means that each row in the results will be a tuple, with the last value being the mapped data (returned as the value from the mapper function).

See also

Look at the documentation for Verse.mapreduce for a longer explaination.

Note

Bug fixes in v0.4.0

orderby(*attrs)

Orders a SubVerse by the attributes given

paginate(count)

Pages the collection.

If a collection has 5000 documents, calling verse.paginate(500) will return 10 pages of 500 documents each.

See also

Look at the paginate function for Verse to see an example.

purify(*cols)

Converts the given attributes to a NumPy recarray. This can provide a speed boost for data filtering, grouping and manipulation.

See also

Please look at the Verse ‘purify’ documentation on how to use this functionality.

reduce(mapper_results, reducer, expand=True, sort=True)

This function ‘reduces’ the data returned from each map group. Reducers are meant to return a single value per group. However, due to python’s typing you can return a list, dictionary or tuple because they are objects themselves.

See also

Look at the documentation for Verse.reduce for a longer explaination.

Note

Bug fixes in v0.4.0

simple_reduce(reducer)

This fuction simply returns the result of passing the entire dataset to the reducing function.

col = uv.collection
count = col.simple_reduce(len)

Python’s len function is passed the entire dataset and returns the count. Reducers are menat to return a single result for the entire dataset. However, no one enforces that rule. You could return a list object and no one would care.

Note

Bug fixes in v0.4.0

skip(count)

Returns all documents after the user-defined number have been skipped.

unique(*attrs)

Finds all the unique values for one or more columns

See also

Please look at the unique functionality for a Verse for a more comprehensive usage explanation.

class underverse.NecRow(*args, **kwargs)

Each document is based on a Python dict object. When interacting with SQlite, this class is the storage object. Python classes can be persisted using Underverse as long as they have a __dict__ function. If object is the super class you should be fine.

When documents are persisted, they are given a UUID as well as created_at and updated_at properties. This is why the data persisted is slightly larger than the raw data.

The persisted object can both get and set variables easily using the dot syntax.

#connect to existing document store
uv = Underverse("existing.db")

#each Verse (or SQlite table) has iterator functionality
for person in uv.people:

  #Get first name and last name
  first = person.first_name
  last = person.last_name

  #set full name
  person.full_name = "%s %s" % (first, last)
class underverse.KeyValue(key, value)

This class adds support for storing key-value documents.

The key-value storage is built on top of the current document infrastructure. Each document has a UUID, a created_at and updated_at attributes.

Key-Value documents override the UUID variable with a custom user-defined value. Where as the ‘value’ portion is stored like all other documents.

Adding and getting keys and there values is simple:

uv = Underverse()
options = uv.options

# Add key by using the ``put`` method
options.put('key', 'value')

# Get the associated value
print options.get('key')

#This raises a KeyError.
print options.get('key2')

Here’s another way to do it, but it’s not the easiest. The first code example makes it easier. The following code gives a more verbose explanation.

from underverse import Underverse, KeyValue

uv = Underverse()
options = uv.options

n = KeyValue('key', 5)
options.add(n)

# Key-Values can also be used as other documents as well...
for o in options:
  print o
  o.value2 = o.value * 2
  options.update(o)

# Basically, the following line is doing the same thing as a KeyValue instance
# n = NecRow({'uuid':'key', 'value':5})

Querying

The search syntax is close to SQLAlchemy and the API is similar to MongoDB.

During the early development phases, I decided to mimic the MongoDB query syntax. However, after the initial version I found it to be more verbose than SQL-Alchemy. At that time, I made a design change and converted everything to reflect the SQL-Alchemy filter functionality. This proved to be better in the end because it’s both more readable and more concise. So it was a win on both fronts.

Here’s a list of all the currently supported conditional statements.

class underverse.model.Document(name)

This class is the main class for handling document queries.

Here are some query examples:

print Document.age < 5
print Document.age <= 5
print Document.age > 5
print Document.age >= 5
print Document.age == 5
print Document.age != 5
print Document.name.len(5)
print Document.age.btw(20, 50)
print Document.age
print Document.name.type(str)
print Document.name.in_(['Max'])
print Document.name.nin(['Max'])
print Document.name.match('Max')
print Document.name.search('ax')
print Document.name.nmatch('Max')
print Document.name.nsearch('Max')

# cheezy example
def email_search(value, domain=None):
        if domain in value:
                return True
        else:
                return False

#defining you own predicate functions...
# if the function returns 'True', then the object will be included in the results, otherwise, it will be left out.
#also, any arguments passed to the 'udp' function, they will be passed to the given predicate as input. 
for d in verse.find(Document.email.udp(email_search, domain='gmail')):
        print d

# for the above example, you could have done it like this.
for d in verse.find(Document.email.search('gmail')):
        print d

To query for the first object, use the find_one function instead of find.:

# to find the first document, call find_one instead of find
for d in verse.find_one(Document.email.udp(email_search, domain='gmail')):
        print d

As of v0.3.5, compound or nested query objects are supported. Here’s what I mean by that:

uv = Undeverse()
test = uv.test

# create user class
class User(NecRow):
  def __init__(self, name, password):
      self.name = name
      self.password = password

# create comment class
class Comment(object):
        def __init__(self, text):
                super(Comment, self).__init__()
                self.text = text

# create 5 users
for i in range(5):
        user = User('user%s' % i, 'password')
        user.comment = Comment('hey%s' % i)
        test.add(user)

# find all users that have posted a comment that equals 'hey1'
# notice the possibility to stacking attributes for a more powerful query
# in previous versions you would have had to write a UDP or UDF
for u in test(D.comment.text == 'hey1'):
        print "'%s'" % u#.comment.text

# find_one also works
u = test.find_one(D.comment.text == 'hey1')
print u.comment.text
btw(left, right)

Finds documents where an attribute is between the given left and right inputs

from underverse.model import Document as D

# finds all users whose age is between 18 and 25
uv.users.find(D.age.btw(18, 25))
in_(value)

Finds documents where an attribute is in the given list

from underverse.model import Document as D

# finds all users whose name is either 'Max' or 'Tamara'
uv.users.find(D.name.in_(['Max', 'Tamara']))
in_list(value)

Finds documents where an attribute is in the given list

Note

This predicate calls list on the input before it is compared. This can be useful for passing other generators to the predicate. However, if the input value is already a list, just call D.in_ instead.

from underverse.model import Document as D

# finds all users whose name is either 'Max' or 'Tamara'
uv.users.find(D.name.in_list(['Max', 'Tamara']))
in_set(value)

Finds documents where an attribute is in the given set.

Note

This predicate calls set on the input before it is compared. This can be useful for passing other generators to the predicate. However, if the input value is already a set, just call D.in_ instead.

from underverse.model import Document as D

# finds all users whose name is either 'Max' or 'Tamara'
uv.users.find(D.name.in_(['Max', 'Tamara']))
len(value)

Find documents where an attributes is a certain length

from underverse.model import Document as D

# finds all users whose names are only 3 characters long
uv.users.find(D.name.len(3))
static limit(value)

Limits the amount of documents found.

uv.docs.find(D.limit(50))
static limskip(_skip, _limit)

Skips the first few documents found and also limits the records found based on the given input.

# skips 50 records and returns the next 50
uv.docs.find(D.limskip(50, 50))
match(value)

Uses the re module to match data attributes. The regex must match exactly. The given regex would fail if there was an attached port to the IP attribute.

# re.compile('10\.2\.1\.\d+').match(doc[ip])
uv.docs.find(D.ip.match('10\.2\.1\.\d+'))
nin(value)

Finds all documents where the attribute is NOT in the given list. Look at in_ for an example.

nin_list(value)

Finds all documents where the attribute is NOT in the given list. Look at in_list for an explanation.

nin_set(value)

Finds all documents where the attribute is NOT in the given set. Look at in_set for an explanation.

nmatch(value)

This finds the opposite of the match predicate.

nsearch(value)

This finds the opposite of the search predicate.

static orderby(*args)

Orders documents by one or more columns.

uv.docs.find(D.orderby('name', '-age'))

The orderby functionality now has ASC and DESC capability. Descending order is achieved by pre-pending a - (negative sign or hyphen) to the column name.

Notice that the orderby is actually inside the find function.

However, you can also do this:

uv.docs.orderby('name', '-age')
search(value)

Uses the re module to search data attributes. The regex doesn’t have to match exactly. The given regex would NOT fail if there was an attached port to the IP attribute.

# re.compile('10\.2\.1\.\d+').search(doc[ip])
uv.docs.find(D.ip.search('10\.2\.1\.\d+'))
static skip(value)

Skips the first few documents found based on the given input.

# skips 50 records
uv.docs.find(D.skip(50))
type(value)

Finds documents where an attribute’s type matches the input type

uv.users.find(D.age.type(int))

Note

This may not ever be used...

static udf(function, *args, **kwargs)

Passes each document to the user defined function along with any args and kwargs. This can be used to filter documents on multiple attributes of the data along with other advanced functionality.

Note

The UDF takes the entire collection and returns a subset of documents matching complex criterion. This differs from the UDP functionality in the the UDP only receives a single attribute of one document at a time.

def complex_filter(array):
        subset = []
        for doc in array:
                if some_ninja_math:
                        subset.append(doc)
        return subset

uv.docs.find(D.udf(complex_filter))

Or a real example...

# finds all docs where x**y > 4
def sq_filter(array, goal=2):
        subset = []
        for doc in array:
                if doc.x ** doc.y > goal:
                        subset.append(doc)
        return subset

for d in verse.find(Document.udf(sq_filter, 4)):
        print d
udp(function, *args, **kwargs)

User defined predicates or UDP can be used if the existing comparison operators are not enough.

from underverse.model import Document as D

#filters out documents where the sqrt of a selected attribute is between a given range
def sqrt_filter(doc, lower_bound=2, upper_bound=39):
        if upper_bound >= math.sqrt(doc) >= lower_bound:
                return True
        else:
                return False

# finds all docs whose 'some_number' attribute's sqrt is between 16 and 64
uv.docs.find(D.some_number.udp(sqrt_filter, lower_bound=16, upper_bound=64))

The UDP function only receives the ‘some_number’ attribute and has to return a bool discerning to filter out or include each row. You’ll also notice that any *args or **kwargs are forwarded to the method. This is to allow for more flexibility and DRY code.

Note

If you are astute, you’ll quickly see the limitation of this method. However, it might be useful for some. Because the udp method only passes one attribute, more complex filters will not work easily. If you need a filter which requires multiple attributes of the data, use the udf method instead.

class underverse.model.Join(left, right, alias)

This class facilitates ‘join’ operations in Underverse. Traditionally, join operations are used to merge different datasets based on some common attribute. Underverse provides this functionality, as well as, methods for performing MUCH more powerful merging.

There are currently 3 different options for joining different datasets: ON, UDP and UDF.

  • ON

    The ON functionality is designed to work similar to the JOIN in traditional SQL. However, unlike SQL, all matching documents are stored in the result for every original document.

    # SELECT * from posts p JOIN comments c ON p.id = c.post_id
    for post in posts.join(comments, 'comments').on(D.id, D.post_id):
    
            # prints all comments found in the join operation
            print post.comments
    

    In the example above, all results are stored in the comments variable.

  • UDP

    User-Defined Predicates are similar to the UDPs found in the find functionality. This method allows coders to customize the ON functionality as explained above. UDPs must return either True or False.

    These predicates are slower than both of the other two join methods. This is because ON is O(2N) while UDPs are O(N^2). UDPs are only slightly slower than UDFs. UDPs can provide complex join operations with very little code.

    # This join finds all comments for each post that were created within the first 5 minutes of the post's creation
    for post in posts.join(comments, 'early_comments').udp(D.id, D.post_id, 
            lambda left, right: True if right.created_at - left.created_at < 300 else False):
    
            # prints all comments found in the join operation
            print post.early_comments
    

    You can also do this.

    def udp_(left, right):
            if r.created_at - left.created_at < 300::
                    return True
            return False
    
    # This join finds all related comments AND only includes early comments within 5 minutes of the creation of the post
    for post in posts.join(comments, 'early_comments').udp(D.id, D.post_id, udp_):
    
            # prints all comments found in the join operation
            print post.early_comments

    The code above finds all comments posted with 5 minutes of the post’s creation. UDPs act similar to the ON clause in that it uses two attributes to match with. However, they give some freedom to filter on more than the two original attributes.

    Note

    Traditional SQL

    UDPs can be compared to SQL like so:

    Imagine this query, SELECT * FROM posts p JOIN comments c ON p.id = c.post_id WHERE c.created_at - p.created_at < 300;.

    The SQL above performs the same action as the code posted in the section above.

    In the example above, all results are stored in the early_comments variable.

  • UDF

    This is by far the most powerful join operation in Underverse. UDFs do not have the ON functionality of UDPs. Therefore you can customize the JOIN logic ‘til Underverse come..

    Join UDFs give the coder full reign over the algorithm used to find the documents you are looking for. Therefore you can use binary searches and the like to perform joining much faster than a simple loop.

    def early_comments(left, right_array):
            values = []
            for r in right_array:
                    if r.post_id == l.id and r.created_at - left.created_at < 300:
                            values.append(r)
            return values
    
    # This join finds all related comments AND only includes early comments within 5 minutes of the creation of the post
    for post in posts.join(comments, 'early_comments').udf(early_comments):
    
            # prints all comments found in the join operation
            print post.early_comments
    

Note

Performance Hint

ON joins are probably going to be the fastest in most if not all cases. However, that is the least flexible method. UDPs are normally slower than UDFs just by the shear number of function calls. UDFs are the most flexible in terms of customization, and they allow the join operation to be tailored to the data. Which means special algorithms can be used to greatly increase it’s speed.

on(left, right, omit=False, default=[])

The ON methodology mimics what is found in traditional RDBMS and SQL.

See also

Look at the class description for usage

udf(udf)

User-Defined Functions provide for the most flexible join operations available in Underverse.

See also

Look at the class description for usage

udp(left, right, udp)

User-Defined Predicates provide for slightly more flexible joins than the ON method.

See also

Look at the class description for usage

class underverse.model.QuasiDead(recarray)

The QuasiDead are quoted to be those ‘who deprive themselves of all nourishment ... for mental advancement’.

Think of the unstructured aspect of your data that is being forfeited to gain added functionality from NumPy. More specifically, the QuasiDeads are user-defined subsets of the data contained in a document collection.

This data is turned into a NumPy recarray for perhaps more convenient data grouping, filtering and manipulation.

Warning

Because of the heavy reliance upon NumPy, this added functionality requires NumPy to be installed.

find(*bools)

This function selects all documents where the logical AND of the arguments are true.

Note

This functionality requires NumPy to be installed as it uses NumPy as the back end.

uv = Underverse()
uv.load('data.sql')

users = uv.users

# you must define data attributes to use in the NumPy recarray
qd = users.purify('name', 'age', 'gender')

# finds all males
males = qd.find(qd.gender == 'M')

#finds all males who are less than or equal to 25 years of age
# multiple AND arguments are separated by commas
young_males = qd.find(qd.gender == 'M', qd.age <= 25)

# finds all users whose names are either 'Max' OR 'Tamara'
# notice that the OR syntax is slightly different
# OR conditions MUST be surrounded by parentheses and separated by a |
# instead of being comma delimited
young_males = qd.find((qd.name == 'Max') | (qd.name == 'Tamara'))
groupby(*cols)

Groups a QuasiDead instance by one or more attributes. This works exactly the same way as a Verse groupby.

uv = Underverse()
uv.load('data.sql')

qd = test.purify('name', 'age', 'gender')
start = time.time()
for name, ppl in qd.groupby('name'):
        print name, len(ppl)
orderby(*args)

Orders a QuasiDead instance by one or more attributes

qd.orderby('name'):
unique(*cols)

Finds all the unique combinations of one of more columns.

uv = Underverse()
uv.load('data.sql')

data = uv.data
qd = data.purify('name', 'age')
qd.unique('name', 'age')
class underverse.predicates.AND(*filters)

This provides for the logical AND-ing of conditions. This is the default behavior of Underverse.

However, this can be used in conjunction with the OR to perform more powerful queries.

# SELECT * FROM test WHERE 
#   (name = 'Billy' AND age = 31) OR 
#   (name = 'Zaphod' AND (age between 60 and 65)));
r = test.find(OR(AND(Document.name == 'Zaphod', Document.age.btw(60, 65)), AND(Document.name == 'Billy', Document.age == 31)))

Note

Any conditions separated by a comma in the find functions are AND-ed together.

class underverse.predicates.OR(*filters)

This class provides for the logical OR-ing of conditions.

from underverse.model import Document as D

# SELECT * FROM test WHERE 
#   ((age BETWEEN 30 AND 35) OR (age BETWEEN 60 AND 65)) AND 
#   (name = 'Billy' OR name = 'Zaphod'));
r = uv.users.find(OR(D.age.btw(30, 35), D.age.btw(60, 65)), OR(D.name == 'Billy', D.name == 'Zaphod'))

The code above selects all ‘users’ who are 30-35 OR 60-65 years old AND whose names are either ‘Billy’ OR ‘Zaphod’. The name filter can be simplified by using D.name.in_(['Billy', 'Zaphod'])

Change Log

0.4.1

  • Fixed bug in grouping of objects

0.4.0

  • Completely rewrote / refactored the object handling to use jsonpickle instead of the v0.3.0 home-grown solution
    • The result is much simpler and has more robust object handling
  • Changed JSON extension functionality to use jsonpickle
  • Added new package called handlers
  • Added AND / OR predicates for more complex queries
  • Test count increased to 86
  • Fixed issue with paginate returning empty first page
  • MapReduce changes
    • Fixed critcal mapreduce bug in calling reduce
    • Renamed the Verse.reduce function to Verse.simple_reduce
      • This is to maintain compatibility and naming conventions between Verses and SubVerses
      • Added new reduce function to Verse objects which allows map results to be reduced several times without having re-run the map operation.
  • Fixed a few small bugs
  • Added more exceptions
  • Added documentation for map, reduce and mapreduce functionality

Warning

v0.4.0 breaks backward compatibility. This means that v0.4 might have problems loading/reading dumps or sqlite databases that were created with older versions. In-memory databases shouldn’t see any differences. Now using jsonpickle to handle object encoding

Also, I removed the need to call Underverse.create_mappers when decoding objects. As long as the object has been imported things should work automatically.

0.3.6

  • Added Join capability
  • Added more robust object handling
  • Added built-in support for datetime.datetime, datetime.date and datetime.time objects
  • Added more tests
  • Expanded documentation
  • Fixed iterator bugs

0.3.5

  • Added Key-Value functionality (Including support for easy ENUMs)
  • Expanded query capability to include nested objects and complex object querys
  • Small fixes to object handling
  • Increased test count to 64
  • Added VACUUM option to purging

0.3.1

  • Tweaks to MapReduce functionality to make it actually useable
  • Added MapReduce tests

0.3a

  • Small changes to setup.py

0.3

  • Underverse is now officially beta
  • Greatly increased test coverage between v0.2.3 - v0.3 (current total: 45 tests)
  • Fully integrated ODM support
    • This means that Underverse can now actually be called an ODM. v0.3 offers full Object-to-Document functionality, including the auto-encoding of class instances into JSON objects, as well as easy decoding (a simple one-liner) of the JSON back into complete Python objects. And the kicker...all this is done without the need for a base class.
  • Added capability for user-defined encoding and decoding of classes. The built-in functionality should work for all but the most demanding of cases (classes defined in C/C++).
  • Added capability to ‘export’ to a Numpy recarray through a process I like to call ‘purification’. This can be used to easily gain any additional functionality provided by the amazing NumPy.
  • Rewrote ORDER BY functionality
    • Added complete support of ascending and descending sorting by multiple document attributes
  • Simplified the usage and improved readability of skip and limit functions
  • Added alternate query method by overloading the __call__ function for document collections (A.K.A. Verses)
  • Fixed bug in find_one function
  • Added initial, intelligent support for the storing of NumPy ndarrays
  • Fixed spelling errors in and increased breadth of documentation

0.2.3

  • Added more documentation to the existing query methods
  • Added initial support for UDFs

0.2.1 - 0.2.2

  • Removed remaining stream dependencies
  • Fixed bug in processing user defined predicates

0.2

  • Completely rewrote find / query methods
  • Can now stack conditions successfully
  • Fixed initial skip, limit, limskip and orderby functionality
  • Added initial reduce and mapreduce functionality
  • Removed stream dependency
0.1.1
  • fixed critical bug in inserting of Python dicts
0.1
  • Project Start

ToDo / Brain-storming

Top Priority

  • Add more documentation for:
    • the QuasiDeads (ie. ‘purify’ functionality)
    • the adding and decoding of Python objects
    • the MapReduce functionality

On the list

  • Add tests for groupby, map and reduce functionality
  • Add support for dynamic attribute naming (ie. naming predicates)
  • Possibly adding support for loading of delimited files
  • Customized data loading by using UDFs
  • Possible ENUM capability?
  • Through the use of preprocessors? (ie. UDFs / UDPs)
  • Data validation framework through an extension of unittest
  • RuleSets and Rules
  • RuleSet - parent class
  • has rules
  • rules can be applied to datasets which filter / flag conditional criterion
  • data attributes could be used to store rule results
    • rules -> RuleSet -> Rule
  • Is there a need for a @rule decorator?

Indices and tables