4. analysis — Helper functions for data transformation

This module contains functions operating on lists of value pairs. This form of data organization allows to perform complex transformation on data by splitting them into several smaller steps, all taking one list of value pairs as input and returning their result as another one.

Each item in such a list represents a point in two dimesions. The first value is the x coordinate, and may be of any type that supports ordering, e.g. int, float or datetime. The second value a list of values that describe the y coordinate. Using a list of values instead of a single value might seem confusing at first, but this is what makes this kind of multiple step transformation possible.

Suppose, we want to plot a histogram of a number of floating point values between 0 and 10. The first step would be to split the sample into bins by taking the whole part of each number, and then counting the values in the bins. After the first step each bin has an x coordinate (the whole part of the numbers in it) and a list of values expressing the y coordinate. For a histogram, the y coordinate would be the number of values in a bin. Or we could start with a group of values and return their mean and standard deviation, which would both be needed for a statistically meaningful representation of the original data. In this case we begin with a list of values (the sample data), and the result is another list containing the means and standard deviations of each group.

The function that performs this transformation is recode(). It takes three parameters, a list of value pairs as described above (data) and two functions: one for the x coordinate (fx) and one for y (fy). In the first step, all x coordinates are replaced by fx(x). Then all items with the same resulting new x coordinate are united, which is done by connecting the y value lists into one, while preserving the original order. Finally, these lists are passed into fy and replaced by the result (fy(y)), which should also have the form of a list of values if you want to use the resulting data as input for another similar step.

For the histogram example, we start off with this:

# original sample:
data = [0.1, 0.4, 1.3, 2.1, 2.6, 2.8, ..., 9.1, 9.8]

First, we have to make a list of pairs. Since we only want to count the data, it doesn’t matter much what is in the y part. We could represent each value by a 1 and sum them up. But to remain flexible, we keep the original values and count them at the end:

data = [(di, [di]) for di in data]
# data = [(0.1, [0.1]), (0.4, [0.4]), ..., (9.8, [9.8])]

Now we want to consolidate values with the same whole part:

data = recode(data, fx=int)

The fx parameter has to be a function, in this case int(), that will convert the original floating point values into integers. int() is built in, but we can also define our own functions, e.g.:

def my_fx(x):
    return int(x)

data = recode(data, fx=my_fx)

Or, if the function is a short one, we can create it on the spot using lambda:

data = recode(data, fx=lambda x: int(x))

All three above examples produce the same result, which makes data look like this:

data = [(0, [0.1, 0.4]), (1, [1.3]), ..., (9, [9.1, 9.8])]

The remaining step is to replace each list of values with a list containing their number as the only item:

def my_fy(y):
     return [len(y)]

data = recode(data, fy=my_fy)

Or, equivalently:

data = recode(data, fy=lambda y: [len(y)])

This gives us the following result, which is all we need to create a histogram:

data = [(0, [2]), (1, [1]), (2, [3]), ..., (9, [2])]

The fx part of the transformation is executed before the fy part, and they both can be performed in one call to recode(), turning the whole process into an one-liner:

data = recode(data, fx=int, fy=lambda y: [len(y)])

Now we have the data for the histogram, but it is not a form that makes it easy to plot. It would make more sense to separate the x and y values at the end of the calculation into two separate lists, that can then be plotted with matplotlib. For this purpose, there is a function called unpack():

x, y = unpack(data)
# x = [0, 1, 2, ..., 9]
# y = [2, 1, 3, ..., 2]

4.1. Functions

4.1.1. Data conversion from raw to lists of pairs

analysis.extract(dataset, key, start=datetime(2000, 1, 1), end=datetime.utcnow())

This function serves as the interface between raw datasets and all other functions in this module. It takes all samples that satisfy the (optional) date range specified by start and end and builds a list of pairs from them by using the "time" value as the x coordinate and key as y. The resulting list has the structure required by recode and other related functions.

analysis.by_date(data, key, start=datetime(2000, 1, 1), end=datetime.utcnow())

If you want to calculate values for each day of a period, the first step after calling extract() would be to group values belonging to the same day. This function performs both steps at once and is a shortcut for

data = extract(dataset, key)
data = recode(data, fx=lambda x: x.date())
analysis.by_time(data, key, start=datetime(2000, 1, 1), end=datetime.utcnow())

Another common task after extract() -ing a range of samples is to discard the dates and build sample groups according to their time, e.g. followed by splitting them further into 1-hour-bins (see quant_time()) and calculating their averages. This function does both steps and is similar to by_date, but instead of discarding the time value of samples, it removes the date part. The x coordinates are of the type timedelta.

4.1.2. Transformations on lists of pairs

analysis.recode(data, fx=None, fy=None)

This function applies transformations first to the x coordinates of data, followed by another one for the resulting y value lists. Both fx and fy are optional. See the tutorial above for an explanation of functionality.

analysis.quant_time(data, size=3600, start=timedelta(0), fy=None)

With this function you can split data which has timedelta type x coordinates into bins of size seconds each. start specifies the beginning of the first bin, i.e. if you want to split data into groups of 2 hours, the first one starting at 01:00 and going til 03:00, you would set start to timedelta(seconds=3600) (1 hour after 00:00) and size to 7200 (2 hours). Optionally, fy can be applied to the resulting bins.

analysis.quant_time_raw(data, size=3600, start=timedelta(0), fy=None)

Same as quant_time(), except that x coordinates of data items are datetime values.

4.1.3. List content management

analysis.split(data, fx)

This function applies fx to all items in data, but instead of replacing the original x coordinates with the new ones, it makes groups with the same value for fx(x) by adding another level of hierarchy:

data = [(0.1, [...]), (0.4, [...]), (1.2, [...]), ...]
data = split(data, int)

# data = {0: [(0.1, [...]), (0.4, [...])],
#         1: [(1.2, [...])],
#         ...}
analysis.merge(data1, data2)

merge combines both lists into one, appending all y value lists from data2 with x coordinates already present in data1 to those in data1.

analysis.unpack(data)

This functions returns the contents of data as separate lists. The first one contains all x coordinates. The second one is either another list with all y coordinates, if y values are 1-element-lists:

x, y = unpack([(x1, [y1]), (x2, [y2])])
# x = [x1, x2]
# y = [y1, y2]

or, if y values consist of more than one entry, they are grouped together by index (all first values in one list, all second values in the next one, etc.):

x, (y, z) = unpack([(x1, [y1, z1]), (x2, [y2, z2])])
# x = [x1, x2]
# y = [y1, y2]
# z = [z1, z2]

4.1.4. Other functions

analysis.sliding_window(data, length, step=None, start=None)

This is generater which returns slices of data. In each step its return value contains all items with x coordinates ranging from start to start + length, whereafter start is incremented by step. If step is not specified, it is set to length. If no start value is supplied, the x coordinate of the first item in data is used. For example:

sliding_window(data, 3, 1, 0)

would first return all values with x coordinates in [0, 3), on the next iteration in [1, 4), etc. until the end of data is reached.

analysis.guess_dt(data)

If items in data are spaced in regular steps, you can use this function to take a sample of the first 100 items and return the most common difference between one item and the next one in the sequence.

analysis.date_gaps(data, step=timedelta(1))

If data contains items with regular steps, but there are some values missing, this can lead to artifacts when e.g. plotting the data as a line chart. The value before the gap would be connected to the one thereafter, instead of there being a interruption in the line. This function will fill gaps with lists of as many None values as there are in other items. It expects datetime values as coordinates.

analysis.close_gaps(data, step)

Similar to date_gaps(), but suitable for other types of x coordinates. The drawback is that the steps between x values must be precise. Due to rounding errors it might behave incorrectly if float values are used.