Thinking about sampling scales in host-associated microbiota

A recent publication by RT Jones, L.G. Sanchez, and N Feirer: "A cross-taxon analysis of insect-associated bacterial diversity" started me thinking about how we sample insects. You can see the article here at PLoSONE:

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0061218

In this publication, the authors perform a large collection of diverse insects and then subsequently a 16S rRNA gene amplicon analysis. They use the V1-V2 region and 454 pyrosequencing. They use whole insects as the source of material for the DNA extractions and perform a set of bioinformatics processing steps downstream in an attempt to reduce artifacts that may lead to inflation of diversity estimates in their samples. From their published methods, this seems to be the order in which reads were processed:

1. Reads are truncated (to compare across the same 16S positions) and removed “low quality reads” using QIIME’s default settings – I am not sure if amplicon noise removal was used as implemented therein nor did they mention any alignment to any 16S model.

2. Sequences are binned at 97% identity using uclust and the authors picked the most abundant sequence for downstream analyses

3. Then, phylotypes that did not represent at least 1% of community membership within the sample were removed. I should point out that from their methods, at this point, it seems that each sample is a different total size so that this 1% will necessarily be a different total number of sequences removed from each.

4. Any sample that does that have at least 500 sequences is removed. Perhaps because that number was small – they say the minimum was 61 in some cases.

In the end, across all samples, they were left with 477 unique “phylotypes” (read: 97% identity OTUs). Their goals in this study were to identify links between diet and microbiome composition and also analyze the diversity of microbes found associated with insects of different, diverse genera. They do not find any effect of diet, at least based on their taxonomic classification of hosts (morphological) and published understandings of what these hosts might consume. See Figure 2 from the paper below. Let’s explore whether or not, given the sampling, this a surprising result.

Figure 2. Bray-Curtis cluster of insect species based on their associated bacterial communities (all 477 bacterial phylotypes used for Bray-Curtis analysis) and Z-scores of the 96 most abundant bacterial phylotypes with lowest scores in light blue and highest scores in dark blue.

Each column is a unique bacterial phylotype. Phylotypes are arranged according to taxonomic classification. Insect species are identified by a four-letter code (Table 1) with the first letter indicating the order, as follows: (C) Coleoptera, (D) Diptera, (H) Hemiptera, (I) Isoptera, (L) Lepidoptera, (N) Neuroptera, (S) Siphonoptera, and (Y) Hymenoptera.

doi:10.1371/journal.pone.0061218.g002

Sampling of insects

How many bacterial cells do you suppose cover the surfaces of a single insect? How many live within their cells in specialized organs or within their reproductive tracts? Estimates based on microscopic observations and qPCR data on single copy functional genes suggest that insects infected with bacterial symbionts harbor upwards of 10⁹bacterial cells (for that one, specific symbiont) [1, 2] and indeed, 200,000 Buchnera are estimated to be transmitted in each single generation [3]. Ok, so if your insect happens to be infected with a bacterial symbiont, your dataset will include this organism in abundance (as found by the study reviewed here for several of their samples – some of which were >90% Wolbachia). Also, these kinds of symbioses tend to be very species specific – a single rRNA gene phylotype is likely to be found for these bacteria. How many other bacterial cells do you think there are on and in this host? Do you think you can effectively sample them in the presence of an over-abundant single phylotype?

Now let’s consider the fact that when most of us sample insects, we grind up the entire body. This isn’t just a laziness issue, it’s due to the fact that insects are small, difficult to dissect, and often preserved in ways that make it even more difficult to perform the dissection post-mortem. How many different habitats do you think we are combining into a single sample when we grind up entire insects? At least, from my perspective (Wolbachia-centric and all), the reproductive tract and the gut are distinct and that’s not considering possible specialized organs. When we sampled the honey bee gut vs. the entire body of the bee, we saw considerable differences in terms of both diversity sampled and in the composition of the community, largely due to the rank abundance distribution of the reads and the ability to deeply sample the gut vs. a diverse set of habitats on the honey bee [4]. Now consider that in the Jones et al study, we got 500 sequences from each insect. Do we still think it reasonable to perform the kind of frequency-based OTU-culling performed here and by others?

Consider that for sampling the human microbiome, we very deeply probe – 2,000 sequences or more – individual stool samples (containing a subset of the gut, at best). How many OTUs do you think we would find if we ground up an entire human and used 1g of that homogenate to run a PCR and produced 500 sequences?

To be fair, the authors are aware of the sampling limitation of their work – the fact that they used whole insects – and address it head on here:

“Our research, however, has two limitations that may restrict our ability to detect the effect of diet on bacterial community composition: 1) the broad diversity of insect samples, and 2) the use of whole insects. A study focused on a less diverse group of insects with varied diets may be better suited to resolve the effects of taxonomy and diet on bacterial communities. By using whole insects, we maximized our detection of insect-associated bacteria but also included endosymbionts (i.e. intracellular bacteria) in our analyses.”

I would argue that by using whole insects they maximized their ability to detect very abundant symbionts but they do not sample all habitats on the host evenly.

What does frequency-based culling do to our data? Why do it?

One of the first studies that explored potential contaminants in the “rare-biosphere” is the E.coli sequencing paper by Kunin et al published back in 2009 [5]. In that study, the authors sequence a laboratory strain of E. coli (a mock community of a single organism with multiple rRNA operons). They find multiple OTUs that are not classified as E. coli and pass the QC filtration steps used by others previously. In their own words:

"These contaminants represent only 0.03% of the reads obtained in the present study and suggest that all PCR-based surveys that use broad-specificity primers will likely suffer from similar low-level background contamination, a point worth bearing in mind when interpreting rare biosphere data."[5]

How do we incorporate these results in our data interpretation from natural communities? How do we bias our dataset when we set an arbitrary threshold for how many times an OTU must occur before we consider it “real”?

Several studies following the Kunin et al analysis chose not to perform a frequency-based cull (or at least did not report it in their methods): the Dominguez-Bello infant study used 2,000 reads per sample [6], the Caporaso study used 3.1 million reads, deciding that 2,000 was sufficient to characterize that environment [7]. Some other authors have taken advantage of the rank abundance curve to quality filter their reads without completely deleting the rare biosphere [8], and others have pointed out that severe amplicon noise filtration, and indeed any diversity metric has to make some assumptions about the underlying population from which we are sampling (an unknown) and that amplicon noise reduction algorithms may not be based on actual read characteristics and can bias the data [9, 10].

All noise filtration pipelines not strictly focused on Phred scores and read-specific parameters make assumptions about the underlying, unsampled community. By removing OTUs that don’t exist at 1% or greater in your single sample you are necessarily making an assumption about the diversity you expect to find in that habitat and biasing the results. It’s one thing to consider removing singletons - reads that occur only once and may be contamination - but quite another to consider only OTUs at a larger, arbitrary frequency/percentage.

One of the results in the Jones et al paper really highlights the fact that they did not deeply sample their insects. Consider the termite (Isoptera:Termitidae) called Coptotermes formosanus. This wood-eating termite is host to many diverse bacteria and protists – it’s often used in microbiology courses focusing on symbiosis due to the neat morphologies and behaviors you can observe under the microscope after dissection of the hind gut. Others who have more deeply sampled the termite gut (admittedly, a different species) have found between 300 - 700 bacterial OTUs present there (depending on the primer set used see [11]), and a metagenomic sampling of Coptotermes formosanus itself also revealed a large amount of bacterial diversity [12]. Although we don’t have the raw number of OTUs found in the Coptotermes sample in the Jones et al study you can see the average richness in Table 2 is listed as 5.83. This is extremely low based on our understanding of the termite gut and expectations from these natural environments.

1. Ijichi, N., et al., Internal spatiotemporal population dynamics of infection with three Wolbachia strains in the adzuki bean beetle, Callosobruchus chinensis (Coleoptera: Bruchidae). Appl Environ Microbiol, 2002. 68(8): p. 4074-80.

2. Sakurai, M., et al., Rickettsia symbiont in the pea aphid Acyrthosiphon pisum: novel cellular tropism, effect on host fitness, and interaction with the essential symbiont Buchnera. Appl Environ Microbiol, 2005. 71(7): p. 4069-75.

3. Mira, A. and N.A. Moran, Estimating population size and transmission bottlenecks in maternally transmitted endosymbiotic bacteria. Microb Ecol, 2002. 44(2): p. 137-43.

4. Mattila, H.R., et al., Characterization of the active microbiotas associated with honey bees reveals healthier and broader communities when colonies are genetically diverse. PLoS One, 2012. 7(3): p. e32962.

5. Kunin, V., et al., Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol, 2010. 12(1): p. 118-23.

6. Dominguez-Bello, M.G., et al., Delivery mode shapes the acquisition and structure of the initial microbiota across multiple body habitats in newborns. Proc Natl Acad Sci U S A, 2010. 107(26): p. 11971-5.

7. Caporaso, J.G., et al., Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc Natl Acad Sci U S A, 2011. 108 Suppl 1: p. 4516-22.

8. Reeder, J. and R. Knight, Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat Methods, 2010. 7(9): p. 668-9.

9. Haegeman, B., et al., Robust estimation of microbial diversity in theory and in practice. ISME J, 2013.

10. Gaspar, J.M. and W.K. Thomas, Assessing the consequences of denoising marker-based metagenomic data. PLoS One, 2013. 8(3): p. e60458.

11. Engelbrektson, A., et al., Experimental factors affecting PCR-based estimates of microbial species richness and evenness. ISME J, 2010. 4(5): p. 642-7.

12. Xie, L., et al., Profiling the metatranscriptome of the protistan community in Coptotermes formosanus with emphasis on the lignocellulolytic system. Genomics, 2012. 99(4): p. 246-55.