2. readdata — Data input from files

As of now, the only way to load your data built in to the library is through CSV files. They may have a certain degree of freedom in their structure, but it must be possible to convert every data row in the file into the form that some of the functions (especially the plots) expect. Specifically, they have to fulfill these criteria:

  • the first line must contain the header, i.e. its cells must be row names.

  • each row will be converted into a dictionary that must have the value "time", which should be of the type datetime and contain the UTC timestamp of the given sample, and at least one further value, which is the actual measuremet result. If there if one measurement column which you would like to call "x", each sample of the dataset will look like this:

    {'time': datetime(...), 'x': ... }
    

    The naming of measurement columns is up to you, but a "time" value must be present and consist of a complete timestamp, a date and a time, for this sample. The date and time parts may be in separate columns and can be combined in a later step.

Loading data is done by creating an object of the type CSVReader and supplying the filename and some further parameters describing the internal structure of the CSV file.

After loading, the samples can be accessed by index. Iterating over the whole dataset is also supported:

data = CSVReader('data.txt', ...)

print data[0]

for di in data:
   print di

2.1. Classes

class readdata.ActivityDataset

This is the base class that implements the basic interface for datasets. It provides the functions necessary to access individual samples by index, iterate over the complete dataset, printing a summary of its contents and returning a portion of the data as a list of x-y-pairs (see analysis). To implement a new interface which is not yet provided (e.g. one that can load data from an SQL database), you should derive it from this class.

class readdata.CSVReader(filename, lat=None, lon=None, delimiter='\t', skiplines=0, columns=None, coltypes=None, format=None)

The filename parameter specifies the path to the file you want to load. lon and lat can be used to set the geographic coordinates, which are used by some plots (plot.actogram()) and are also mandatory if you want to look at differences between day and night data.

The remaining parameters are needed for correct interpretation of the file. One possibility is to supply each of columns, delimiter (if it’s not a tab), skiplines and coltypes:

delimiter
The character used to separate columns, with '\t' (tab) as default value.
skiplines
How many lines to skip after the header. If the first row with actual data immediately follows the header, skiplines should be 0. Any other number will lead to this many lines following the header being ignored.
columns

A dictionary that specifies which columns should be read in from the file as well as provides shorter names for them (for convenience). The first line of the file must be a header. If the columns are labeled “UTC_Date”, “UTC_Time” and “Value”, the dictionary may look like this:

columns = {'date': 'UTC_Date', 'time': 'UTC_Time', 'val': 'Value'}

The short column names are not what ends up in the dataset, they are only used for the type conversion step (see coltypes below).

coltypes
This parameter has to be a function that takes a dictionary as specified by columns as parameter and converts its values to the format that is expected for datasets, i.e. a dictionary with a "time" value and as many further values as you want to read in from your file. Since the input file is read in as text, individual fields have to be converted to proper types before usage.

Here is an example of how columns and coltypes work together:

# input file has this structure:
#
# UTC_Date    UTC_Time  Value
# 2010-01-01  10:00:00     50

cols = {'date': 'UTC_Date', 'time': 'UTC_Time', 'val': 'Value'}

def my_coltypes(line):
    # line = {'date': '2010-01-01', 'time': '10:00:00', 'val': '50'}
    year, mon, day = line['date'].split('-')
    hour, min_, sec = line['time'].split(':')
    # convert values from string to int
    year, mon, day = int(year), int(mon), int(day)
    hour, min_, sec = int(hour), int(min_), int(sec)

    result = {}
    result['time'] = datetime(year, mon, day, hour, min_, sec)
    result['x'] = int(line['val'])

    return result

dataset = CSVReader(filename, ..., columns=cols, coltypes=my_coltypes)

If your input file doesn’t have a header line, it is still possible to load it. Instead of a dictionary, the columns parameter can be a list with numbers of columns that you want to read in. This way, the parameter line passed to the coltypes function will also be a list instead of a dictionary, the contents of which you can then access by index. Please note that the order of entries in columns does matter, they will appear in line in the same order, even if it is not the same order they are in in the file. For example:

# input file has this structure:
#
# UTC_Date    UTC_Time  Value1  Value2
# 2010-01-01  10:00:00      50      15

cols = [1, 0, 2]
# Value2 will not be imported

def my_coltypes(line):
    # line[0] = "UTC_Time", line[1] = "UTC_Date", line[2] = "Value"
    ...
format

Alternatively, if your data is in format already known to the module, you can simply set the format parameter and the values for the parameters above will be set accordingly. Currently, there is only one format:

format value Description
“vas” Produced by GPS Plus from Vectronic Aerospace when activity data is exported as “Spreadsheet”. Delimited by spaces, has unit descriptions in the second line. The columns “UTC_Date” and “UTC_Time” are combined to the key “time” in the resulting dataset, “ActivityX”, “ActivityX” and “Temp” are converted to “x”, “y” and “temp” respectively.

If there are other formats you are using frequently that you would like to see added to the module, send us a description or an example file.

After successful creation, a dataset object has these public properties and methods:

dt

Sampling interval duration. Its type is what results if you subtract the "time" values of two samples, i.e. timedelta for datetime values or float for float.

dt_sec

Sampling interval in seconds as float.

lat

The geographic latitude of the study region, needed for sun line overlays in plot.actogram() and other sun related calculations (see analysis).

lon

The geographic longitude. See lat.

extract(key, start=datetime(2000, 1, 1), end=datetime.utcnow()):

Return data for the specified period as pairs of date and value in the format expected by functions in analysis. Equivalent to analysis.extract().

summary()

Print a short description of the dataset, including the covered date range and sampling period duration.

Table Of Contents

Previous topic

1. Tutorial

Next topic

3. plot — Graphical facilities

This Page