************************************************************ :mod:`analysis` --- Helper functions for data transformation ************************************************************ .. module:: analysis :synopsis: Helper functions for data transformation This module contains functions operating on lists of value pairs. This form of data organization allows to perform complex transformation on data by splitting them into several smaller steps, all taking one list of value pairs as input and returning their result as another one. Each item in such a list represents a point in two dimesions. The first value is the x coordinate, and may be of any type that supports ordering, e.g. ``int``, ``float`` or ``datetime``. The second value a list of values that describe the y coordinate. Using a list of values instead of a single value might seem confusing at first, but this is what makes this kind of multiple step transformation possible. Suppose, we want to plot a histogram of a number of floating point values between 0 and 10. The first step would be to split the sample into bins by taking the whole part of each number, and then counting the values in the bins. After the first step each bin has an x coordinate (the whole part of the numbers in it) and a list of values expressing the y coordinate. For a histogram, the y coordinate would be the number of values in a bin. Or we could start with a group of values and return their mean and standard deviation, which would both be needed for a statistically meaningful representation of the original data. In this case we begin with a list of values (the sample data), and the result is another list containing the means and standard deviations of each group. The function that performs this transformation is :func:`recode`. It takes three parameters, a list of value pairs as described above (``data``) and two functions: one for the x coordinate (``fx``) and one for y (``fy``). In the first step, all x coordinates are replaced by ``fx(x)``. Then all items with the same resulting new x coordinate are united, which is done by connecting the y value lists into one, while preserving the original order. Finally, these lists are passed into ``fy`` and replaced by the result (``fy(y)``), which should also have the form of a list of values if you want to use the resulting data as input for another similar step. For the histogram example, we start off with this:: # original sample: data = [0.1, 0.4, 1.3, 2.1, 2.6, 2.8, ..., 9.1, 9.8] First, we have to make a list of pairs. Since we only want to count the data, it doesn’t matter much what is in the ``y`` part. We could represent each value by a 1 and sum them up. But to remain flexible, we keep the original values and count them at the end:: data = [(di, [di]) for di in data] # data = [(0.1, [0.1]), (0.4, [0.4]), ..., (9.8, [9.8])] Now we want to consolidate values with the same whole part:: data = recode(data, fx=int) The ``fx`` parameter has to be a function, in this case :func:`int`, that will convert the original floating point values into integers. :func:`int` is built in, but we can also define our own functions, e.g.:: def my_fx(x): return int(x) data = recode(data, fx=my_fx) Or, if the function is a short one, we can create it on the spot using ``lambda``:: data = recode(data, fx=lambda x: int(x)) All three above examples produce the same result, which makes data look like this:: data = [(0, [0.1, 0.4]), (1, [1.3]), ..., (9, [9.1, 9.8])] The remaining step is to replace each list of values with a list containing their number as the only item:: def my_fy(y): return [len(y)] data = recode(data, fy=my_fy) Or, equivalently:: data = recode(data, fy=lambda y: [len(y)]) This gives us the following result, which is all we need to create a histogram:: data = [(0, [2]), (1, [1]), (2, [3]), ..., (9, [2])] The ``fx`` part of the transformation is executed before the ``fy`` part, and they both can be performed in one call to :func:`recode`, turning the whole process into an one-liner:: data = recode(data, fx=int, fy=lambda y: [len(y)]) Now we have the data for the histogram, but it is not a form that makes it easy to plot. It would make more sense to separate the x and y values at the end of the calculation into two separate lists, that can then be plotted with ``matplotlib``. For this purpose, there is a function called :func:`unpack`:: x, y = unpack(data) # x = [0, 1, 2, ..., 9] # y = [2, 1, 3, ..., 2] Functions ========= Data conversion from raw to lists of pairs ------------------------------------------ .. function:: extract(dataset, key, start=datetime(2000, 1, 1), end=datetime.utcnow()) This function serves as the interface between raw datasets and all other functions in this module. It takes all samples that satisfy the (optional) date range specified by ``start`` and ``end`` and builds a list of pairs from them by using the ``"time"`` value as the x coordinate and ``key`` as y. The resulting list has the structure required by ``recode`` and other related functions. .. function:: by_date(data, key, start=datetime(2000, 1, 1), end=datetime.utcnow()) If you want to calculate values for each day of a period, the first step after calling :func:`extract` would be to group values belonging to the same day. This function performs both steps at once and is a shortcut for :: data = extract(dataset, key) data = recode(data, fx=lambda x: x.date()) .. function:: by_time(data, key, start=datetime(2000, 1, 1), end=datetime.utcnow()) Another common task after :func:`extract` -ing a range of samples is to discard the dates and build sample groups according to their time, e.g. followed by splitting them further into 1-hour-bins (see :func:`quant_time`) and calculating their averages. This function does both steps and is similar to ``by_date``, but instead of discarding the time value of samples, it removes the date part. The x coordinates are of the type ``timedelta``. Transformations on lists of pairs --------------------------------- .. function:: recode(data, fx=None, fy=None) This function applies transformations first to the x coordinates of ``data``, followed by another one for the resulting y value lists. Both ``fx`` and ``fy`` are optional. See the tutorial above for an explanation of functionality. .. function:: quant_time(data, size=3600, start=timedelta(0), fy=None) With this function you can split data which has ``timedelta`` type x coordinates into bins of ``size`` seconds each. ``start`` specifies the beginning of the first bin, i.e. if you want to split data into groups of 2 hours, the first one starting at 01:00 and going til 03:00, you would set ``start`` to ``timedelta(seconds=3600)`` (1 hour after 00:00) and ``size`` to 7200 (2 hours). Optionally, ``fy`` can be applied to the resulting bins. .. function:: quant_time_raw(data, size=3600, start=timedelta(0), fy=None) Same as :func:`quant_time`, except that x coordinates of data items are ``datetime`` values. List content management ----------------------- .. function:: split(data, fx) This function applies ``fx`` to all items in ``data``, but instead of replacing the original x coordinates with the new ones, it makes groups with the same value for ``fx(x)`` by adding another level of hierarchy:: data = [(0.1, [...]), (0.4, [...]), (1.2, [...]), ...] data = split(data, int) # data = {0: [(0.1, [...]), (0.4, [...])], # 1: [(1.2, [...])], # ...} .. function:: merge(data1, data2) ``merge`` combines both lists into one, appending all y value lists from ``data2`` with x coordinates already present in ``data1`` to those in ``data1``. .. function:: unpack(data) This functions returns the contents of ``data`` as separate lists. The first one contains all x coordinates. The second one is either another list with all y coordinates, if y values are 1-element-lists:: x, y = unpack([(x1, [y1]), (x2, [y2])]) # x = [x1, x2] # y = [y1, y2] or, if y values consist of more than one entry, they are grouped together by index (all first values in one list, all second values in the next one, etc.):: x, (y, z) = unpack([(x1, [y1, z1]), (x2, [y2, z2])]) # x = [x1, x2] # y = [y1, y2] # z = [z1, z2] Other functions --------------- .. function:: sliding_window(data, length, step=None, start=None) This is generater which returns slices of ``data``. In each step its return value contains all items with x coordinates ranging from ``start`` to ``start + length``, whereafter ``start`` is incremented by ``step``. If ``step`` is not specified, it is set to ``length``. If no ``start`` value is supplied, the x coordinate of the first item in ``data`` is used. For example:: sliding_window(data, 3, 1, 0) would first return all values with x coordinates in [0, 3), on the next iteration in [1, 4), etc. until the end of ``data`` is reached. .. function:: guess_dt(data) If items in ``data`` are spaced in regular steps, you can use this function to take a sample of the first 100 items and return the most common difference between one item and the next one in the sequence. .. function:: date_gaps(data, step=timedelta(1)) If ``data`` contains items with regular steps, but there are some values missing, this can lead to artifacts when e.g. plotting the data as a line chart. The value before the gap would be connected to the one thereafter, instead of there being a interruption in the line. This function will fill gaps with lists of as many ``None`` values as there are in other items. It expects ``datetime`` values as coordinates. .. function:: close_gaps(data, step) Similar to :func:`date_gaps`, but suitable for other types of x coordinates. The drawback is that the steps between x values must be precise. Due to rounding errors it might behave incorrectly if ``float`` values are used.