.. _example_session_2:
Example 2: Differential expression scatterplots
===============================================
This example looks more closely at using the results table part of
:mod:`metaseq`, and highlights the flexibility in plotting afforded by
:mod:`metaseq`.
.. code:: python
# Enable in-line plots for this IPython Notebook
%matplotlib inline
Setup
-----
In this section, we'll get the example data for control and knockdown
samples, combine the data, and and create :class:`ResultsTable` object
out of them.
If you haven't already done so, run the
`download_metaseq_example_data.py` script, which will download and
prepare from public sources.
Import what we'll be using:
.. code:: python
import metaseq
from metaseq import example_filename
from metaseq.results_table import ResultsTable
import pandas
import numpy as np
import matplotlib
import pybedtools
import gffutils
from gffutils.helpers import asinterval
import os
We'll be using tables prepared from Cufflinks GTF output from GEO
entries GSM847565 and GSM847566. These represent results control and
ATF3 knockdown experiments in the K562 human cell line. You can read
more about the data on GEO; this example will be more about the features
of :mod:`metaseq` than the biology.
Let's get the example files:
.. code:: python
%%bash
example_dir="metaseq-example"
if [ -e $example_dir ]; then echo "already exists";
else
mkdir -p $example_dir
(cd $example_dir \
&& wget --progress=dot:giga https://raw.githubusercontent.com/daler/metaseq-example-data/master/metaseq-example-data.tar.gz \
&& tar -xzf metaseq-example-data.tar.gz \
&& rm metaseq-example-data.tar.gz)
fi
.. parsed-literal::
already exists
.. code:: python
data_dir = 'metaseq-example/data'
control_filename = os.path.join(data_dir, 'GSM847565_SL2585.table')
knockdown_filename = os.path.join(data_dir, 'GSM847566_SL2592.table')
Let's take a quick peak to see what these files look like:
.. code:: python
# System call; IPython only!
!head -n5 $control_filename
.. parsed-literal::
id score fpkm
ENST00000456328 108.293110992801 1.1183355144
ENST00000515242 87.2330185289764 0.8306172562
ENST00000518655 175.175608597549 2.3676823979
ENST00000473358 343.232678519699 9.7952652359
As documented at http://cufflinks.cbcb.umd.edu/manual.html#gtfout, the
`score` field indicates relative expression of one isoform compared to
other isoforms of the same gene, times 1000. The max score is 1000, and
an isoform with this score is considered the major isoform. A score of
800 would mean an isoform's FPKM is 0.8 that of the major isoform.
If you're working with DESeq results, the
:mod:`metaseq.results_table.DESeqResults` class is a nice wrapper
around those results with one-step import. But here, we'll construct a
`pandas.DataFrame` first and then create a `ResultsTable` object out
of it.
.. code:: python
# Create two pandas.DataFrames
control = pandas.read_table(control_filename, index_col=0)
knockdown = pandas.read_table(knockdown_filename, index_col=0)
Here's what the first few entries look like:
.. code:: python
control.head()
.. raw:: html
|
score |
fpkm |
id |
|
|
ENST00000456328 |
108.293111 |
1.118336 |
ENST00000515242 |
87.233019 |
0.830617 |
ENST00000518655 |
175.175609 |
2.367682 |
ENST00000473358 |
343.232679 |
9.795265 |
ENST00000408384 |
0.000000 |
0.000000 |
.. code:: python
knockdown.head()
.. raw:: html
|
score |
fpkm |
id |
|
|
ENST00000456328 |
290.752543 |
6.503301 |
ENST00000515242 |
253.364453 |
4.790326 |
ENST00000518655 |
23.190962 |
0.174388 |
ENST00000473358 |
510.475081 |
33.409877 |
ENST00000408384 |
0.000000 |
0.000000 |
These are two separate objects. It will be easier to work with the data
if we first combine the data into a single dataframe. For this we will
use standard `pandas` routines:
.. code:: python
# Merge control and knockdown into one DataFrame
df = pandas.merge(control, knockdown, left_index=True, right_index=True, suffixes=('_ct', '_kd'))
df.head()
.. raw:: html
|
score_ct |
fpkm_ct |
score_kd |
fpkm_kd |
id |
|
|
|
|
ENST00000456328 |
108.293111 |
1.118336 |
290.752543 |
6.503301 |
ENST00000515242 |
87.233019 |
0.830617 |
253.364453 |
4.790326 |
ENST00000518655 |
175.175609 |
2.367682 |
23.190962 |
0.174388 |
ENST00000473358 |
343.232679 |
9.795265 |
510.475081 |
33.409877 |
ENST00000408384 |
0.000000 |
0.000000 |
0.000000 |
0.000000 |
Now we'll create a :class:`metaseq.results_table.ResultsTable` out of
it:
.. code:: python
# Create a ResultsTable
d = ResultsTable(df)
:class:`ResultsTable` objects are wrappers around `pandas.DataFrame`
objects, and are useful for working with annotations and tablular data.
You can always access the `DataFrame` with the `.data` attribute:
.. code:: python
# DataFrame is always accessible via .data
print type(d), type(d.data)
.. parsed-literal::
The `metaseq` example data includes a GFF file of the genes on
chromosome 17 of the hg19 human genome assembly:
.. code:: python
# Get gene annotations for chr17
gtf = os.path.join(data_dir, 'Homo_sapiens.GRCh37.66_chr17.gtf')
print open(gtf).readline()
.. parsed-literal::
chr17 protein_coding exon 30898 31270 . - . gene_id "ENSG00000187939"; transcript_id "ENST00000343572"; exon_number "1"; gene_name "DOC2B"; gene_biotype "protein_coding"; transcript_name "DOC2B-201";
Subsetting data
---------------
The data we loaded from the knockdown experiment contains genes from all
chromosomes. For the sake of argument, let's say we're only interested
in the expression data for these genes on chr17. We can simply use
`pandas.DataFrame.ix` to subset dataframe by a list of genes. Note
that for this to work, the items in the list need to be in the index of
the dataframe. Since the data frame index consists of Ensembl transcript
IDs, we'll need to create a list of Ensembl transcript IDs on chromosome
17:
.. code:: python
# Get a list of transcript IDs on chr17, and subset the dataframe.
# Here we use pybedtools, but the list of names can come from anywhere
names = list(set([i['transcript_id'] for i in pybedtools.BedTool(gtf)]))
names.sort()
# Make a copy of d
d2 = d.copy()
# And subset
d2.data = d2.data.ix[names]
# How many did we omit?
print "original:", len(d.data)
print "chr17 subset:", len(d2.data)
.. parsed-literal::
original: 85699
chr17 subset: 5708
Scatterplots
------------
Let's plot some data. The :meth:`ResultsTable.scatter` method helps
with plotting genome-wide data, and offers lots of flexibility.
For its most basic usage, we need to at least supply `x` and `y`.
These are names of variables in the dataframe. We'll add more data
later, but for now, let's plot the FPKM of control vs knockdown:
.. code:: python
# Scatterplot of control vs knockdown FPKM
d2.scatter(
x='fpkm_ct',
y='fpkm_kd');
.. image:: example_session_2_files/example_session_2_30_0.png
If you're following along in a terminal with interactive `matplotlib`
plots, you can click on a point to see what gene it is. In this IPython
Notebook (and the HTML documentation generated from it), we don't have
that interactive ability. We can simulate it here by choosing a gene ID
to show, and then manually call the `_default_callback` like this:
.. code:: python
# arbitrary gene for demonstration purposes
interesting_gene = np.argmax(d2.fpkm_ct)
interesting_gene
.. parsed-literal::
'ENST00000253788'
.. code:: python
# What happens if you were to click on the points in an interactive session
d2._default_callback(interesting_gene)
.. parsed-literal::
score_ct 1047.517457
fpkm_ct 1422.448488
score_kd 1070.752317
fpkm_kd 1671.190119
Name: ENST00000253788, dtype: float64
Clicking around interactively on the points is a great way to get a feel
for the data.
OK, it looks like this plot could use log scaling. Recall though that
the `ResultsTable.scatter` method needs to have `x` and `y`
variables available in the dataframe. So one way to do this would be to
do something like this:
.. code:: python
# Adding extra variables gets verbose and cluttered
d2.data['log_fpkm_ct'] = np.log1p(d2.data.fpkm_ct)
But when playing around with different scales, this quickly pollutes the
dataframe with extra columns. Let's delete that column . . .
.. code:: python
# We'll use a better way, so delete it.
del d2.data['log_fpkm_ct']
. . . and show another way.
You may find it more streamlined to use the `xfunc` and/or `yfunc`
arguments. We can use any arbitrary function for these, and the axes
labels will reflect that.
Since we're about to start incrementally improving the figure by adding
additional keyword arguments (kwargs), the stuff we've already talked
about will be at the top, and a comment line like this will mark the
start of new stuff to pay attention to:
::
# ------------- (marks the start of new stuff)
Here's the next version of the scatterplot:
.. code:: python
# Scale x and y axes using log2(x + 1)
def log2p1(x):
return np.log2(x + 1)
d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
#----------------
xfunc=log2p1,
yfunc=log2p1,
);
.. image:: example_session_2_files/example_session_2_39_0.png
Of course, we can specify axes labels either directly in the method call
with `xlab` or `ylab`, or after the fact using standard
`matplotlib` functionality:
.. code:: python
# Manually specify x and y labels
ax = d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
xfunc=log2p1,
yfunc=log2p1,
#-----------------------------
# specify xlabel
xlab='Control, log2(FPKM + 1)'
);
# adjust the ylabel afterwards
ax.set_ylabel('Knockdown, log2(FPKM + 1)');
.. image:: example_session_2_files/example_session_2_41_0.png
Let's highlight some genes. How about those that change expression > 2
fold in upon knockdown in red, and < 2 fold in blue? While we're at it,
let's add another variable to the dataframe.
.. code:: python
# Crude differential expression detection....
d2.data['foldchange'] = d2.fpkm_kd / d2.fpkm_ct
up = (d2.foldchange > 2).values
dn = (d2.foldchange < 0.5).values
The way to highlight genes is with the `genes_to_highlight` argument.
OK, OK, it's a little bit of a misnomer here because we're actually
working with transcripts. But the idea is the same.
The `genes_to_highlight` argument takes a list of tuples. Each tuple
consists of two items: an index (boolean or integer, doesn't matter) and
a style dictionary. This dictionary is passed directly to
`matplotlib.scatter`, so you can use any supported arguments here.
.. note::
There are actually other kwargs you can use in the style dictionary -- for example, `names`, `marginal_histograms`, `xhist_kwargs`, and `yhist_kwargs`. These are `metaseq`-specific and will be explained later.
Here's the plot with up/downregulated genes highlighted:
.. code:: python
# Use the genes_to_highlight argument to show up/downregulated genes
# in different colors
d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
xfunc=log2p1,
yfunc=log2p1,
xlab='Control, log2(FPKM + 1)',
ylab='Knockdown, log2(FPKM + 1)',
#-------------------------------
genes_to_highlight=[
(up, dict(color='#da3b3a')),
(dn, dict(color='#00748e'))]
);
.. image:: example_session_2_files/example_session_2_47_0.png
We can add a 1-to-1 line for reference:
.. code:: python
# Add a 1:1 line
d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
xfunc=log2p1,
yfunc=log2p1,
xlab='Control, log2(FPKM + 1)',
ylab='Knockdown, log2(FPKM + 1)',
genes_to_highlight=[
(up, dict(color='#da3b3a')),
(dn, dict(color='#00748e'))],
#------------------------------------------
one_to_one=dict(color='r', linestyle='--'),
);
.. image:: example_session_2_files/example_session_2_49_0.png
Let's change the plot style a bit. The `general_kwargs` argument
determines the base style of all points. By default, it's
`dict(color='k', alpha=0.2, linewidths=0)`. Let's change the default
style to smaller gray dots, and make the red and blue stand out more by
adjusting their alpha:
.. code:: python
# Style changes:
# default gray small dots; make changed genes stand out more
d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
xfunc=log2p1,
yfunc=log2p1,
xlab='Control, log2(FPKM + 1)',
ylab='Knockdown, log2(FPKM + 1)',
one_to_one=dict(color='k', linestyle=':'),
#------------------------------------------------------
genes_to_highlight=[
(up, dict(color='#da3b3a', alpha=0.8)),
(dn, dict(color='#00748e', alpha=0.8))],
general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5),
);
.. image:: example_session_2_files/example_session_2_51_0.png
Marginal histograms
-------------------
:mod:`metaseq` also offers support for marginal histograms, which are
stacked up on either axes for each set of genes that were plotted. There
are lots of ways for configuring this. First, let's turn them on for
everything:
.. code:: python
# Add marginal histograms
d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
xfunc=log2p1,
yfunc=log2p1,
xlab='Control, log2(FPKM + 1)',
ylab='Knockdown, log2(FPKM + 1)',
genes_to_highlight=[
(up, dict(color='#da3b3a', alpha=0.8)),
(dn, dict(color='#00748e', alpha=0.8))],
one_to_one=dict(color='k', linestyle=':'),
general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5),
#------------------------------------------------------
marginal_histograms=True,
);
.. image:: example_session_2_files/example_session_2_53_0.png
As a contrived example to illustrate the flexibility for plotting
marginal histograms, lets:
- only show histograms for up/down regulated
- change the number of bins to 50
- remove the edge around each bar
.. code:: python
# Tweak the marginal histograms:
# 50 bins, don't show unchanged genes, and remove outlines
d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
xfunc=log2p1,
yfunc=log2p1,
xlab='Control, log2(FPKM + 1)',
ylab='Knockdown, log2(FPKM + 1)',
one_to_one=dict(color='k', linestyle=':'),
general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5),
#------------------------------------------------------
# Go back go disabling them globally...
marginal_histograms=False,
# ...and then turn them back on for each set of genes
# to highlight.
#
# By the way, genes_to_highlight is indented to better show the
# the structure.
genes_to_highlight=[
(
up,
dict(
color='#da3b3a', alpha=0.8,
marginal_histograms=True,
xhist_kwargs=dict(bins=50, linewidth=0),
yhist_kwargs=dict(bins=50, linewidth=0),
)
),
(
dn,
dict(
color='#00748e', alpha=0.8,
marginal_histograms=True,
xhist_kwargs=dict(bins=50, linewidth=0),
yhist_kwargs=dict(bins=50, linewidth=0),
)
)
],
);
.. image:: example_session_2_files/example_session_2_55_0.png
Let's clean up the plot by adding a legend (using `label` in
`genes_to_highlight`), and adding it outside the axes. While we're at
it we'll add a title, too.
There's a trick here -- for each set of genes, the histograms are
incrementally added on top of each other but the legend, lists them
going down. So we need to flip the order of legend entries to make it
nicely match the order of the histograms.
.. code:: python
matplotlib.rcParams['font.family'] = "Arial"
ax = d2.scatter(
x='fpkm_ct',
y='fpkm_kd',
xfunc=log2p1,
yfunc=log2p1,
xlab='Control, log2(FPKM + 1)',
ylab='Knockdown, log2(FPKM + 1)',
one_to_one=dict(color='k', linestyle=':'),
marginal_histograms=False,
#------------------------------------------------------
# add the "unchanged" label
general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5, label='unchanged'),
genes_to_highlight=[
(
up,
dict(
color='#da3b3a', alpha=0.8,
marginal_histograms=True,
xhist_kwargs=dict(bins=50, linewidth=0),
yhist_kwargs=dict(bins=50, linewidth=0),
# add label
label='upregulated',
)
),
(
dn,
dict(
color='#00748e', alpha=0.8,
marginal_histograms=True,
xhist_kwargs=dict(bins=50, linewidth=0),
yhist_kwargs=dict(bins=50, linewidth=0),
# add label
label='downregulated'
)
)
],
);
# Get handles and labels, and then reverse their order
handles, legend_labels = ax.get_legend_handles_labels()
handles = handles[::-1]
legend_labels = legend_labels[::-1]
# Draw a legend using the flipped handles and labels.
leg = ax.legend(handles,
legend_labels,
# These values may take some tweaking.
# By default they are in axes coordinates, so this means
# the legend is slightly outside the axes.
loc=(1.01, 1.05),
# Various style fixes to default legend.
fontsize=9,
scatterpoints=1,
borderpad=0.1,
handletextpad=0.05,
frameon=False,
title='chr17 transcripts',
);
# Adjust the legend title after it's created
leg.get_title().set_weight('bold')
.. image:: example_session_2_files/example_session_2_57_0.png
We'd also like to add a title. But how to access the top-most axes?
Whenever the `scatter` method is called, the `MarginalHistograms`
object created as a by-product of the plotting is stored in the
`marginal` attribute. This, in turn, has a `top_hists` attribute,
and we can grab the last one created. While we're at it, let's
histograms axes as well.
.. code:: python
# Another trick: every time `d2.scatter` is called,
top_axes = d2.marginal.top_hists[-1]
top_axes.set_title('Differential expression, ATF3 knockdown');
for ax in d2.marginal.top_hists:
ax.set_ylabel('No.\ntranscripts', rotation=0, ha='right', va='center', size=8)
for ax in d2.marginal.right_hists:
ax.set_xlabel('No.\ntranscripts', rotation=-90, ha='left', va='top', size=8)
fig = ax.figure
fig.savefig('expression-demo.png')
fig
.. image:: example_session_2_files/example_session_2_59_0.png