Sampling ======== The DNASnout client is using sampling will going through input data. This might be of general interest to people, so we have it easily accessible. We are using a reservoir sampling strategy, which has the following interesting properties: - data are iterated through, and there is no need to keep all data in memory - at any point during the iteration we have a (random) sample of the data so far Random sampling --------------- .. note:: This module originally appeared in :py:mod:`dnasnout_client`. In the context of NGS data, reservoir sampling is particularly convenient because it can let one iterate through data and have at each point in time a random sample of the data iterated through at that moment. .. code-block:: python # open a FASTQ file from ngs_plumbing import parsing filename = "mysequences.fastq" seq_data = parsing.open(filename) # create a ReservoirSample for (up to) 1,000 entries from ngs_plumbing import sampling nreads = 1000 spl = sampling.ReservoirSample(nreads) # iterate through the entries (sequence data + quality) in the FASTQ file for read in seq_data.iter_seqqual(): spl.add(read) We have also added a variation of the reservoir sampling using a bloom filter: an item is considered for reservoir sampling only if not already either present in the sample, or has not been seen before (both being tested with a bloom filter). This sampling class should be used whenever a diversity-based sampling is thought while the sample contains highly duplicated data. .. code-block:: python # open a FASTQ file from ngs_plumbing import parsing filename = "mysequences.fastq" seq_data = parsing.open(filename) # create a ReservoirSample for (up to) 1,000 entries from ngs_plumbing import sampling nreads = 1000 spl = sampling.ReservoirSampleBloom(nreads) # iterate through the entries (sequence data + quality) in the FASTQ file for read in seq_data.iter_seqqual(): spl.add(read) *Normalized* random sampling ---------------------------- The constructor for :class:`ReservoirSampleBloom` accepts a named parameter :param:`itemkey` that can be any function taking the kind of item one wishes to be in the sample as a parameter and returning a string. This is designed to let one express easily custom approaches to filtering, or use arbitrary data structures. For example, the default :param:`itemkey` is designed to work with parsing feature in this package, but it is trivial to make it work with :mod:`biopython` sequences. .. code-block:: python def biopython_sequence(item): return item.seq spl = sampling.ReservoirSampleBloom(nreads, itemkey = biopython_sequence) An other example is if only the first 50 first bases should be considered for the bloom filter (one reason would be that we do not want to let highly abundant reads fill up the sample because high sequence variablity in the last part of the read is letting them fill up the fill up the sample). .. code-block:: python def clipatfifty(item): return item.sequence[:50] spl = sampling.ReservoirSampleBloom(nreads, itemkey = clipatfifty) .. note:: There is an obvious link with the general idea of using streams of entries in pipeline (see :ref:`section-parsing`). Sampling is a way to reduce the amount of sequencing data to work with, and here we have a way to build the sample without having to hold all data in memory.