Tuesday, January 28, 2014

What's the DEal? Differential Expression using RSEM

We've been looking for ways to analyze transcriptomes correctly, with sufficient power, not too many type I and II errors, and not much fuss.  For those relatively unfamiliar with performing differential expression analyses on RNA-seq data, a great review of the statistical methods employed to analyze these data can be found here.

What it all comes down to is the fundamental problem associated with RNA seq experiments -- the absence of a single transcript could be due to down regulation OR, could be due to the up regulation of ANOTHER gene.  That's right, what you are measuring are RELATIVE expression levels, and given libraries of the same size, you cannot accurately distinguish the first scenario from the second unless you've spiked the libraries with some standards of known quantity (which, interestingly enough, has been done before with success by Mary Ann Moran's group here).

From Mary Ann Moran's paper on the subject, we have this very nice depiction of the problems associated with sampling depth, relative number of reads, etc.:

It is very difficult to distinguish between samples 1 and 2 unless you can take into consideration library size or know, using internal standards how number of reads translates to number of copies.  Even if you use internal standards, it should be noted that RNAs have varying half-lives due to their own specific secondary structures, potential protective modifications, etc. Therefore, there will always be some stochasticity associated with the sampling that will reverberate in your final counts of reads.

I have been playing with the program RSEM to calculate both FPKM (fragments per killobase per million mapped reads, = [# of fragments]/[length of transcript in kilo base]/[million mapped reads]) and TPM (transcripts per million mapped reads) values.  In the RSEM publication, the authors convincingly argue that the TPM metric is a much better way of comparing between libraries -- much better than RPKM (or FPKM) alone.  The reason? Libraries are not all of the same size and it is necessarily the case that an increase in expression of any particular gene in one library will lead to the exclusion of other genes. Also, RSEM uses a statistical model to take into account the uncertainty associated with read mapping - especially in transcriptomics where multiple isoforms exist.  Oooh... also, RSEM doesn't require a reference genome -- awesome!

 RSEM's output provides both FPKM values as well as the TPM values, an estimated fraction of ttranscripts made up by a given gene.  I was curious to know how each of these measures would perform on environmental data -- one would assume that they would be correlated! I used RSEM (-rsem-calculate-expression –calc-ci –paired-end) on a set of illumina libraries and found...

that FPKM and TPM values are amazingly well correlated; within a library, sorting by FPKMs or TPMs will give you the same result.  But, what happens when you compare between libraries? Same answer. At least in the data I used, comparing two libraries using FPKM or TPMs results in the same answer with regards to differential expression. That said, I rest easier knowing that for the TPM values generated RSEM also provides 95% confidence intervals, helping me to better assess statistical differences between libraries.

In all that spare time you have, you can compare Edge-R's gene-specific bayesian modeling  found here to RSEM, the statistical software I'll be exploring today found here.
Oh, and here's another nice review, http://www.biomedcentral.com/content/pdf/gb-2010-11-12-220.pdf

Monday, January 20, 2014

How does an obligately intracellular symbiont maintain genetic diversity? The Wolbachia story

I recently had the pleasure of finally sitting down to read some publications (both open access!) on my favorite bacterium, Wolbachia pipientis.  These recent pubs interested me because they focused on the population genetics of Wolbachia within individual hosts, upon host transfer, and after many generations.  The BIG question that comes out of this body of work, in my mind, is how are low-titer strains in the maternally transmitted population maintained!  (We can discuss ongoing hypotheses at the end of this post)

The first paper I'll tackle (Schneider et al) asks if Wolbachia strains exist as diverse quasi-species within a host and reveals that diversity using host transfer techniques.  In "Uncovering Wolbachia Diversity upon Artificial Host Transfer" by Schneider et al., the authors use the cherry fruit fly Wolbachia (wCer strains) as the inoculum for injection of two new hosts: Drosophila simulans or Ceratitis capitata.  For those unfamiliar with the technique, what it comes down to is harvesting many many embryos from your D. simulans, using differential centrifugation techniques to concentrate the Wolbachia fraction and using that, as you would in microinjection of a construct to make transgenic flies.

The cool thing about this paper is that they see cryptic polymorphisms rise after host transfer.   They looked at 150 generations after microinjection and saw a low titer variant increase in frequency such that it was detectable via PCR.  Now, the data in this paper is entirely PCR based -- they sequenced amplifed fragments and used them to detect SNPs.  That said, if found to be true, it suggests that the host and symbiont evolve really rapidly and that Wolbachia maintains diversity, even under conditions when it should be primarily maternally transmitted (lab stocks).

The second paper I'm highlighting (Symula et al., 2013) investigated the diversity of Wolbachia in tsetse fly populations and correlated Wolbachia haplotypes with specific host mtDNA haplotypes.  Their result = LOTS of Wolbachia diversity and evidence that these infections happened independently, multiple times.  The authors collected tsetse flies across a region in Africa and did an analysis of the Wolbachia MLST genes and groEL - they also looked at host mtDNA haplotypes. Again, they used PCR amplification and sequencing but were VERY conservative in their sequence post-processing (removing all recombinants, for example).  So, the data they present are potentially a lower bound estimate of Wolbachia diversity.  The number of haplotypes found within each host was astounding (see Table 1).  In some cases, 6 different haplotypes found within just 2 hosts!

Mechanisms for maintaining genetic diversity in a maternally transmitted symbiont? 

1) Bend the rules:
During my doctoral work, a lab member discovered that there was cryptic diversity within the maternally transmitted endosymbionts of the deep sea clams. In that work, they discovered that a low frequency (0.02) symbiont haplotype existed in a population of clams that were geographically localized.  It was hypothesized there that the trick to maintaining diversity in this maternally transmitted symbiont was to basically bend the rules: occassionally, transmit your symbiont horizontally.  Since we find evidence of horizontal transmission in Wolbachia, this is one mechanism that genetic diversity could be maintained in the population.

2) Increase mutation rates:
It would be theoretically possible for an endosymbiont to have such rapid rates of mutation that individual populations within a single host would exhibit variability detectable by the methods employed by Schneider et al. and Symula et al.  Evolutionary rates are elevated in endosymbionts, so this is a potential source of new genetic diversity for Wolbachia.

It will be quite interesting to see which (or if both?) of these scenarios play a role in Wolbachia genomic evolution.  These changes in symbiont population dynamics and densities could potentially allow Wolbachia to colonize new hosts, potentially acting as a quasi-species (as seen in virus systems).