Sampling
========

The DNASnout client is using sampling will going through input data.

This might be of general interest to people, so we have it easily
accessible. We are using a reservoir sampling strategy, which has
the following interesting properties:

- data are iterated through, and there is no need to keep
  all data in memory

- at any point during the iteration we have a (random) sample
  of the data so far

Random sampling
---------------

.. note::

   This module originally appeared in :py:mod:`dnasnout_client`.

In the context of NGS data, reservoir sampling is particularly convenient
because it can let one iterate through data and have at each point in time
a random sample of the data iterated through at that moment.

.. code-block:: python

   # open a FASTQ file
   from ngs_plumbing import parsing
   filename = "mysequences.fastq"
   seq_data = parsing.open(filename)

   # create a ReservoirSample for (up to) 1,000 entries
   from ngs_plumbing import sampling
   nreads = 1000
   spl = sampling.ReservoirSample(nreads)
   # iterate through the entries (sequence data + quality) in the FASTQ file
   for read in seq_data.iter_seqqual():
       spl.add(read)

         
We have also added a variation of the reservoir sampling using a bloom filter:
an item is considered for reservoir sampling only if not already either
present in the sample, or has not been seen before 
(both being tested with a bloom filter).

This sampling class should be used whenever a diversity-based sampling is
thought while the sample contains highly duplicated data.

.. code-block:: python

   # open a FASTQ file
   from ngs_plumbing import parsing
   filename = "mysequences.fastq"
   seq_data = parsing.open(filename)

   # create a ReservoirSample for (up to) 1,000 entries
   from ngs_plumbing import sampling
   nreads = 1000
   spl = sampling.ReservoirSampleBloom(nreads)
   # iterate through the entries (sequence data + quality) in the FASTQ file
   for read in seq_data.iter_seqqual():
       spl.add(read)


*Normalized* random sampling
----------------------------

The constructor for :class:`ReservoirSampleBloom` accepts a named
parameter :param:`itemkey` that can be any function taking the
kind of item one wishes to be in the sample as a parameter and
returning a string. This is designed to let one express easily
custom approaches to filtering, or use arbitrary data structures.

For example, the default :param:`itemkey` is designed to work
with parsing feature in this package, but it is trivial
to make it work with :mod:`biopython` sequences.

.. code-block:: python

   def biopython_sequence(item):
       return item.seq

   spl = sampling.ReservoirSampleBloom(nreads,
                                       itemkey = biopython_sequence)

An other example is if only the first 50 first bases
should be considered for the
bloom filter (one reason would be that we do not
want to let highly abundant reads fill up the sample because 
high sequence variablity in the last part of the read is letting them
fill up the fill up the sample). 

.. code-block:: python

   def clipatfifty(item):
       return item.sequence[:50]

   spl = sampling.ReservoirSampleBloom(nreads,
                                       itemkey = clipatfifty)


.. note::

   There is an obvious link with the general idea of using streams
   of entries in pipeline (see :ref:`section-parsing`).
   Sampling is a way to reduce the amount
   of sequencing data to work with, and here we have a way to build the
   sample without having to hold all data in memory.