Tuesday, January 28, 2014

What's the DEal? Differential Expression using RSEM

We've been looking for ways to analyze transcriptomes correctly, with sufficient power, not too many type I and II errors, and not much fuss.  For those relatively unfamiliar with performing differential expression analyses on RNA-seq data, a great review of the statistical methods employed to analyze these data can be found here.

What it all comes down to is the fundamental problem associated with RNA seq experiments -- the absence of a single transcript could be due to down regulation OR, could be due to the up regulation of ANOTHER gene.  That's right, what you are measuring are RELATIVE expression levels, and given libraries of the same size, you cannot accurately distinguish the first scenario from the second unless you've spiked the libraries with some standards of known quantity (which, interestingly enough, has been done before with success by Mary Ann Moran's group here).

From Mary Ann Moran's paper on the subject, we have this very nice depiction of the problems associated with sampling depth, relative number of reads, etc.:

It is very difficult to distinguish between samples 1 and 2 unless you can take into consideration library size or know, using internal standards how number of reads translates to number of copies.  Even if you use internal standards, it should be noted that RNAs have varying half-lives due to their own specific secondary structures, potential protective modifications, etc. Therefore, there will always be some stochasticity associated with the sampling that will reverberate in your final counts of reads.

I have been playing with the program RSEM to calculate both FPKM (fragments per killobase per million mapped reads, = [# of fragments]/[length of transcript in kilo base]/[million mapped reads]) and TPM (transcripts per million mapped reads) values.  In the RSEM publication, the authors convincingly argue that the TPM metric is a much better way of comparing between libraries -- much better than RPKM (or FPKM) alone.  The reason? Libraries are not all of the same size and it is necessarily the case that an increase in expression of any particular gene in one library will lead to the exclusion of other genes. Also, RSEM uses a statistical model to take into account the uncertainty associated with read mapping - especially in transcriptomics where multiple isoforms exist.  Oooh... also, RSEM doesn't require a reference genome -- awesome!

 RSEM's output provides both FPKM values as well as the TPM values, an estimated fraction of ttranscripts made up by a given gene.  I was curious to know how each of these measures would perform on environmental data -- one would assume that they would be correlated! I used RSEM (-rsem-calculate-expression –calc-ci –paired-end) on a set of illumina libraries and found...



that FPKM and TPM values are amazingly well correlated; within a library, sorting by FPKMs or TPMs will give you the same result.  But, what happens when you compare between libraries? Same answer. At least in the data I used, comparing two libraries using FPKM or TPMs results in the same answer with regards to differential expression. That said, I rest easier knowing that for the TPM values generated RSEM also provides 95% confidence intervals, helping me to better assess statistical differences between libraries.

In all that spare time you have, you can compare Edge-R's gene-specific bayesian modeling  found here to RSEM, the statistical software I'll be exploring today found here.
Oh, and here's another nice review, http://www.biomedcentral.com/content/pdf/gb-2010-11-12-220.pdf

No comments:

Post a Comment