************************************************************
:mod:`analysis` --- Helper functions for data transformation
************************************************************

.. module:: analysis
   :synopsis: Helper functions for data transformation

This module contains functions operating on lists of value pairs. This form of
data organization allows to perform complex transformation on data by splitting
them into several smaller steps, all taking one list of value pairs as input
and returning their result as another one.

Each item in such a list represents a point in two dimesions. The first value
is the x coordinate, and may be of any type that supports ordering, e.g.
``int``, ``float`` or ``datetime``. The second value a list of values that
describe the y coordinate. Using a list of values instead of a single value
might seem confusing at first, but this is what makes this kind of multiple
step transformation possible.

Suppose, we want to plot a histogram of a number of floating point values
between 0 and 10. The first step would be to split the sample into bins by
taking the whole part of each number, and then counting the values in the bins.
After the first step each bin has an x coordinate (the whole part of the
numbers in it) and a list of values expressing the y coordinate. For a
histogram, the y coordinate would be the number of values in a bin. Or we could
start with a group of values and return their mean and standard deviation,
which would both be needed for a statistically meaningful representation of the
original data. In this case we begin with a list of values (the sample data),
and the result is another list containing the means and standard deviations of
each group.

The function that performs this transformation is :func:`recode`. It takes
three parameters, a list of value pairs as described above (``data``) and two
functions: one for the x coordinate (``fx``) and one for y (``fy``). In the
first step, all x coordinates are replaced by ``fx(x)``. Then all items with
the same resulting new x coordinate are united, which is done by connecting the
y value lists into one, while preserving the original order. Finally, these
lists are passed into ``fy`` and replaced by the result (``fy(y)``), which
should also have the form of a list of values if you want to use the resulting
data as input for another similar step.

For the histogram example, we start off with this::

   # original sample:
   data = [0.1, 0.4, 1.3, 2.1, 2.6, 2.8, ..., 9.1, 9.8]

First, we have to make a list of pairs. Since we only want to count the data,
it doesn’t matter much what is in the ``y`` part. We could represent each value
by a 1 and sum them up. But to remain flexible, we keep the original values and
count them at the end::

   data = [(di, [di]) for di in data]
   # data = [(0.1, [0.1]), (0.4, [0.4]), ..., (9.8, [9.8])]

Now we want to consolidate values with the same whole part::

   data = recode(data, fx=int)

The ``fx`` parameter has to be a function, in this case :func:`int`, that will
convert the original floating point values into integers. :func:`int` is built
in, but we can also define our own functions, e.g.::

   def my_fx(x):
       return int(x)

   data = recode(data, fx=my_fx)

Or, if the function is a short one, we can create it on the spot using
``lambda``::

   data = recode(data, fx=lambda x: int(x))

All three above examples produce the same result, which makes data look like
this::

   data = [(0, [0.1, 0.4]), (1, [1.3]), ..., (9, [9.1, 9.8])]

The remaining step is to replace each list of values with a list containing
their number as the only item::

   def my_fy(y):
        return [len(y)]

   data = recode(data, fy=my_fy)

Or, equivalently::

   data = recode(data, fy=lambda y: [len(y)])

This gives us the following result, which is all we need to create a
histogram::

   data = [(0, [2]), (1, [1]), (2, [3]), ..., (9, [2])]

The ``fx`` part of the transformation is executed before the ``fy`` part, and
they both can be performed in one call to :func:`recode`, turning the whole
process into an one-liner::

   data = recode(data, fx=int, fy=lambda y: [len(y)])

Now we have the data for the histogram, but it is not a form that makes it easy
to plot. It would make more sense to separate the x and y values at the end of
the calculation into two separate lists, that can then be plotted with
``matplotlib``. For this purpose, there is a function called :func:`unpack`::

   x, y = unpack(data)
   # x = [0, 1, 2, ..., 9]
   # y = [2, 1, 3, ..., 2]

Functions
=========

Data conversion from raw to lists of pairs
------------------------------------------

.. function:: extract(dataset, key, start=datetime(2000, 1, 1), end=datetime.utcnow())

   This function serves as the interface between raw datasets and all other
   functions in this module. It takes all samples that satisfy the (optional)
   date range specified by ``start`` and ``end`` and builds a list of pairs
   from them by using the ``"time"`` value as the x coordinate and ``key`` as
   y. The resulting list has the structure required by ``recode`` and other
   related functions.

.. function:: by_date(data, key, start=datetime(2000, 1, 1), end=datetime.utcnow())

   If you want to calculate values for each day of a period, the first step
   after calling :func:`extract` would be to group values belonging to the same
   day. This function performs both steps at once and is a shortcut for ::

      data = extract(dataset, key)
      data = recode(data, fx=lambda x: x.date())

.. function:: by_time(data, key, start=datetime(2000, 1, 1), end=datetime.utcnow())

   Another common task after :func:`extract` -ing a range of samples is to
   discard the dates and build sample groups according to their time, e.g.
   followed by splitting them further into 1-hour-bins (see :func:`quant_time`)
   and calculating their averages. This function does both steps and is similar
   to ``by_date``, but instead of discarding the time value of samples, it
   removes the date part. The x coordinates are of the type ``timedelta``.

Transformations on lists of pairs
---------------------------------

.. function:: recode(data, fx=None, fy=None)

   This function applies transformations first to the x coordinates of
   ``data``, followed by another one for the resulting y value lists. Both
   ``fx`` and ``fy`` are optional. See the tutorial above for an explanation of
   functionality.

.. function:: quant_time(data, size=3600, start=timedelta(0), fy=None)

   With this function you can split data which has ``timedelta`` type x
   coordinates into bins of ``size`` seconds each. ``start`` specifies the
   beginning of the first bin, i.e. if you want to split data into groups of 2
   hours, the first one starting at 01:00 and going til 03:00, you would set
   ``start`` to ``timedelta(seconds=3600)`` (1 hour after 00:00) and ``size``
   to 7200 (2 hours). Optionally, ``fy`` can be applied to the resulting bins.

.. function:: quant_time_raw(data, size=3600, start=timedelta(0), fy=None)

   Same as :func:`quant_time`, except that x coordinates of data items are
   ``datetime`` values.

List content management
-----------------------

.. function:: split(data, fx)

   This function applies ``fx`` to all items in ``data``, but instead of
   replacing the original x coordinates with the new ones, it makes groups with
   the same value for ``fx(x)`` by adding another level of hierarchy::

      data = [(0.1, [...]), (0.4, [...]), (1.2, [...]), ...]
      data = split(data, int)

      # data = {0: [(0.1, [...]), (0.4, [...])],
      #         1: [(1.2, [...])],
      #         ...}

.. function:: merge(data1, data2)

   ``merge`` combines both lists into one, appending all y value lists from
   ``data2`` with x coordinates already present in ``data1`` to those in
   ``data1``.

.. function:: unpack(data)

   This functions returns the contents of ``data`` as separate lists. The first
   one contains all x coordinates. The second one is either another list with
   all y coordinates, if y values are 1-element-lists::

      x, y = unpack([(x1, [y1]), (x2, [y2])])
      # x = [x1, x2]
      # y = [y1, y2]

   or, if y values consist of more than one entry, they are grouped together by
   index (all first values in one list, all second values in the next one,
   etc.)::

      x, (y, z) = unpack([(x1, [y1, z1]), (x2, [y2, z2])])
      # x = [x1, x2]
      # y = [y1, y2]
      # z = [z1, z2]

Other functions
---------------

.. function:: sliding_window(data, length, step=None, start=None)

   This is generater which returns slices of ``data``. In each step its return
   value contains all items with x coordinates ranging from ``start`` to
   ``start + length``, whereafter ``start`` is incremented by ``step``. If
   ``step`` is not specified, it is set to ``length``. If no ``start`` value is
   supplied, the x coordinate of the first item in ``data`` is used. For
   example::

      sliding_window(data, 3, 1, 0)
   
   would first return all values with x coordinates in [0, 3), on the next
   iteration in [1, 4), etc. until the end of ``data`` is reached.

.. function:: guess_dt(data)

   If items in ``data`` are spaced in regular steps, you can use this function
   to take a sample of the first 100 items and return the most common
   difference between one item and the next one in the sequence.

.. function:: date_gaps(data, step=timedelta(1))

   If ``data`` contains items with regular steps, but there are some values
   missing, this can lead to artifacts when e.g. plotting the data as a line
   chart. The value before the gap would be connected to the one thereafter,
   instead of there being a interruption in the line. This function will fill
   gaps with lists of as many ``None`` values as there are in other items. It
   expects ``datetime`` values as coordinates.   

.. function:: close_gaps(data, step)

   Similar to :func:`date_gaps`, but suitable for other types of x coordinates.
   The drawback is that the steps between x values must be precise. Due to
   rounding errors it might behave incorrectly if ``float`` values are used.