Friday, October 5, 2012

Quick update - our paper is out

The contents of this blog + more detailed analyses and text are available in our BMC Microbiology paper here:

http://www.biomedcentral.com/1471-2180/12/221/abstract

One reviewer had a very interesting suggestion -- they asked us to add the average bootstrap scores to our heat map figure so that readers could get a sense of sequences that may be "novel" -- that is, that the RDPII-NBC hadn't seen before.  See the PLoS ONE article by Lan et al (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0032491).  As a result you can clearly see that the greengenes training set is the most diverse and best at capturing the diversity found within the honey bee gut (Figure 2A).  That said, a fair number of unique sequences (>1000 out of ~4000) are still unclassified using this training set.  The classifications improve with the addition of honey bee gut specific sequences as do average bootstrap scores (Figure 2B).  Also interesting, and expected, is that training sets based on the 16S clone library data generated previously by others was unable to fully classify the diversity found in the 454 dataset (Figure 2C).  Both the drop in confidence -- as observed by a drop in average bootstrap scores across the board -- and inability of the RDPII-NBC to classify 650 of these sequences.

Wednesday, August 8, 2012

The utility of bacterial nomenclature


Lately I've been thinking a lot about the "culture" behind bacterial nomenclature.  Of course there are the extremes: Shigella and Escherichia coli are classified as different genera but often the phenotype used to characterize these strains is horizontally transmitted (a plasmid harboring pathogenicity determinants) [1].  Then you've got the case of Wolbachia pipientis, the bug inside a bug that my lab currently investigates.  Researchers in the field have decided not to name each strain found in each distinct host, regardless of the divergence between strains [2] (but see [3] for an interesting counterpoint).  Obviously, the species concept in bacteria is extremely difficult to define and has been reviewed at length elsewhere [4-6].  What is absolutely true is that the markers we use to characterize diversity in the environment (be it the rRNA, core proteins, or enzymes) are simply that – markers.  They do not tell us whether or not these organisms are similar in function, phenotype, or genomic structure.  For example, organisms with nearly identical 16S rRNA genes (such as Shigella and Escherichia) can exhibit dramatically different phenotypes during infection.  However, I have yet to find a set of organisms for which 16S rRNA gene divergence doesn’t correlate with genomic divergence.  That is to say, although 16S rRNA gene similarity may obfuscate genomic divergence, differences at this locus necessarily correspond to genomic differences (please do respond if you know of a counter example!).

So, what is the point of  naming bacterial species if we don’t have a species concept? The fundamental utility of nomenclature is to be certain that groups are discussing or researching the same organism: what Ralph Isberg’s lab calls Legionella pneumophila Philadelphia 1 should be the same strain (well, taking generations in two different labs into account) as what was sequenced back in 2004 [7].  Most labs utilize the 16S rRNA gene to taxonomically classify organisms.  Now, this is a tricky proposal to begin with because we don’t have a good sense of a bacterial species concept.  That said, a generally agreed upon threshold for the divergence between species based on the 16S rRNA is 3%. That is, we expect organisms that are of the same “species” to be 97% similar or more at that locus.  For example, if I isolated an organism with >97% identity to a Legionella strain, I’d name it after the known species.

What about when exploring relatively “novel” (read: underexplored) groups of bacteria?  I think it’s interesting to consider how researchers in that specific field deal with the taxonomic task ahead of them: do you lumping groups together or do you split them? Do you come up with novel names or do you name them after isolates?  Let’s consider the honey bee gut microbiota, something our group has been investigating recently.  There are some clades, or phylogenetically related groups, that have been considered important by others in the field [In a previous blog post I presented a phylogeny of all near-full length 16S sequences in Genbank within this framework: here].  As you can tell from the phylogeny, these groups are quite diverse.  In fact, the percent divergence within each of these clades – based on the 16S rRNA gene – is quite large (above 10% by nearest neighbor clustering). So, these clades clearly represent something above the species level – perhaps the family or order level.  Interestingly, there are several isolated strains that clade with these groups (Table 1), many isolated from non-honey bee sources, suggesting that they may not be as bee-specific as previously thought.  Recently, two new genus and species names were proposed for the so-called Beta and Gamma-1 groups [8] – importantly, beyond Enterobacetriaceae bacterium Acj204, there are no previously named isolates within these two clades.   Because within these groups there are bacterial isolates that can be studied with regards to their metabolic capabilities (in some cases, their genome sequences have been completed, see ncbi accession #CP001562), we can begin to determine whether or not there are functional differences relevant in the classification of an organism as either Commensalibacter intestini or Saccharibacter florica. For example, the pathogen Bartonella henselae sequence CP00156 (B. henselae) clades with the alpha-1 sequences, a group that often is found in honey bee colonies although the fitness effects on the host are unclear.   Is there a difference between Enterobacteriaceae bacterium Acj204 and Candidatus Gilliamella apicola? Clearly, the relevance of the taxonomic designation below the family level for these bee-specific groups remains to be determined. 


Bee-specific group name
Strain taxonomic designation
Where isolated?
Alpha-2.2
Saccharibacter floricola strain S-877
Pollen
Alpha-2.1
Commensalibacter intestini strain A911
Drosophila melanogaster
Alpha-1
Bartonella grahamii as4aup
Mouse gut
Firm-5
Lactobacillus apis strain 1F1
Honey bee
Gamma-1
Enterobacteriaceae bacterium Acj204
Honey bee





Friday, May 4, 2012

Creating a bee-specific database


Arguably, 454 pyrosequencing has revolutionized the field of microbial ecology.  Where it was once costly to generate libraries of a few hundred 16S rRNA gene sequences, 454 pyrosequencing allows researchers to deeply probe a microbial community at relatively little cost per sequence.  The ultimate goal of 454 pyrosequencing amplicon studies is to characterize a microbial community, either in terms of composition (DNA) or activity (RNA).  A large number of groups have been using the Ribosomal Database Project's Naïve Bayesian Classifier (RDP-NBC) to achieve this goal (Wang et al., 2007). The advantages are numerous but I'll list a few of the practical ones here: classification is straightforward (putting sequences in their taxonomic context), efficient (especially when considering tens of thousands of sequences) and does not require full length 16S sequences (making it an appropriate tool for pyrosequencing studies).  However, the NBC relies on an accurate training set – on reference sequences used to train the model and generate the classification results.  In a publication by Werner et al. (2011), the training set had a significant impact on classification, improving the classification of previously “unclassified” sequences and increasing the number classified to genus [1].

For environments that lack cultured isolates or are relatively unexplored, it can be difficult to find the appropriate training set to reveal the true taxonomic identity of the sequences extracted.  However, if previous clone libraries have generated full length, high-quality 16S sequences these can be added to the seed alignment and the taxonomy framework.  This is what I've aimed to do for the honey bee gut, using Mothur. In Mothur you can “tweak” the alignment seed for any particular environment, creating a custom database that will more accurately classify sequences of interest.  

To create a bee-specific alignment compatible with Mothur you need two files: a reference database and a taxonomy file for each of those sequences. To generate the database I downloaded all sequences that corresponded to accession numbers published in analyses of bee-associated microbiota and that were near full length (1250 bp) (A total of 5,713 sequences were downloaded and 5,158 passed the length threshold).  These sequences were clustered at 99% identity, reducing the dataset to 276 representatives.  This set of sequences were aligned using the SINA aligner (v 1.2.9, [2]) to the arb-silva SSU database (SSURef_108_SILVA_NR_99_11_10_11_opt_v2.arb) and visually inspected using ARB [3].  To generate a phylogeny I used this aligned sequence set as input to RAxML (GTR+g with 1000 bootstrap replicates) using a maximum likelihood framework [4].  Taking a quick look at this taxonomy, it is clear that we've got the majority of sequences falling into either the Firmicutes, Actinobacteria, or Proteobacteria.  Also, specific clades identified by previous groups are clearly marked on the tree (based on the literature). Now, it's difficult to classify novel sequences so you might be asking your self, how could you taxonomically classify them based on this tree?  Fine scale taxonomic placement (below phylum level) for relatively novel bacterial groups is difficult to accomplish and subject to some debate [5].  So, I queried the RDP for nearest cultured representatives.  If these cultured representative was >95% identical to the bee derived sequence then that novel sequence was placed in the genus of the cultured representative.  If, however, the sequence identity above 95% was not found for these sequences in the cultured isolates, but they claded with a cultured representative, they were placed in the same phylum, class, or order (depending on the group) and we noted incerte sedis in the taxonomy file.  In addition to this de novo generation of taxonomic information for these bee sequences, if phylogenetic information (as established by Cox-Foster et al., 2006) was associated with any of these Genbank submissions, that information was also included in the taxonomy.  

I then downloaded each of three pre-existing, Mothur-compatible training sets: 1) the RDP 16S rRNA reference v7 (9,662 sequences), 2) the Greengenes reference (84,414 sequences), and 3) the SILVA bacterial reference (14,956 sequences) each available on the Mothur WIKI page (http://www.mothur.org/wiki/Main_Page).  These datasets are comprised of both an unaligned sequence file and a taxonomy file.  To each of these I added the honey bee specific training set I generated.  Using each of these six alternative datasets (either with or without the honey bee specific sequences), I classified the honey bee gut microbiota generated in our recent publication [4] using the RDP-II Naive Bayesian Classifier [6] and a 60% confidence threshold.  


 Figure 1.  Phylogenetic relationships for the bacterial species included in the honey bee specific database (with bootstrap support indicated above branches if > 75%).  Class level designations are highlighted in red while lower taxonomic designations are indicated out using arrows on nodes.  Specific clades identified previously in honey bees are colored in blue while novel clades identified here, including cultured isolates and well-described genera (such as Wolbachia), are colored in yellow.

What I find most interesting about this analysis is how well the addition of the bee-specific sequences helps to create congruence among the datasets (the Orbus classification by RDP not withstanding).  Clearly, inclusion of environment specific sequences can increase the accuracy of the RDP-NBC.  I wanted to use this framework to explore fine-scale diversity (OTU level) within the gut.  



Figure 2. The effect of training set on the classification of sequences from the honey bee gut visualized by a heat map.   Unique sequences (4,480) were classified using the NBC trained on either RDP, GG, or SILVA (A) or three custom databases including near full length honey bee-associated sequences RDP+bees, GG+bees, SILVA+bees (B).  The effect of including custom sequences is most obvious in the classification discordance between RDP, GG and SILVA and their relative congruence when honey bee associated sequences are added to the training set (B).


Below I ask, how many individual unique sequences and how many likely "species" do we find in each of these families based on 97% clustering of operational taxonomic units (OTUs)? (Table 1). 

Table 1. For each family found with honey bee guts, the number of unique sequences and the number of 97% identity operational taxonomic units (OTUs) is shown.  The taxonomy shown here is based on classification by the RDP-NBC using the SILVA + honey bee sequences training set (available for anyone via email).  The most abundant families represent a large amount of fine-scale bacterial diversity.



Family
Num. unique sequences
OTUs
Enterobacteriaceae
1621
175
gamma-1
436
48
beta
532
35
Bifidobacteriaceae
363
32
firm-5
929
32
firm-4
253
21
alpha-2.1
90
15
alpha-1
65
13
Lactobacilliaceae
86
12
Flavobacteriaceae
2
2
Leuconostocaceae
2
2
Moraxellaceae
6
2
Sphingomonadaceae
2
2
Xanthomonadaceae
2
2
Actinomycetaceae
1
1
Aeromonadaceae
1
1
alpha-2.2
10
1
Clostridiaceae
2
1
Corynebacteriaceae
1
1
Cytophagaceae
1
1
Enterococcaceae
9
1
Incertae_Sedis_XI
1
1
Kineosporiaceae
1
1
Nakamurellaceae
1
1
Oxalobacteraceae
1
1
Prevotellaceae
1
1


One central goal of our previous study (see [7]) was to determine if there was a difference between colonies generated from promiscuous honey bee queens and those that were relatively chaste.  When we compared the OTU content between these two colony types, we found that the genetically diverse colonies host more diverse microflora [7].  This difference, based on number of 97% identity clusters found within each colony, is independent of classification and was recapitulated using the SILVA + honey bee  taxonomic classification (the 95% confidence interval (CI) for mean difference between species diversity compared between colony types exceeded 0; 95% CI = 110, 102; mean = 106.25). 

The next question is, is this difference in microbiota composition attributable to any specific taxonomic group?  That is, within specific bacterial families, do we see differences between genetically diverse and genetically uniform colonies with regards to their OTU content?  I used the bootstrapped confidence interval analysis used in [7] to answer this question.  The difference in genetic diversity between colony types was found to effect the OTU-level diversity of specific bacterial groups (Table 2).  This suggests that fine-scale diversity within these honey-bee specific families may be ecologically relevant, and shouldn’t be ignored.  

Table 2. Total number of operational taxonomic units (97% ID) in either genetically uniform or genetically diverse colonies and classified as one of the honey bee specific taxonomic groups (mean number used in CI calculation in parentheses).  Statistically significant differences between colony types was observed for most of these families and their OTU content (indicated by an asterisk).

Taxon
Genetically Diverse
Genetically Uniform
Bootstrap 95% CI of mean difference
Firm-4*
44 (36.10)
25 (25.04)
(11.22, 10.90)
Firm-5*
56 (45.13)
46 (46.05)
(0.74,1.09)
Alpha-2.1*
21(16.03)
21 (21.04)
(4.92, 5.09)
Alpha-2.2
4 (4.05)
4 (4.01)
(0.09, -0.005)
Alpha-1*
16 (12.01)
13 (13.06)
(0.96, 1.14)
Beta*
60 (48.98)
38 (37.99)
(11.12, 10.85)
Gamma-1*
66 (52.73)
51 (50.99)
(1.94, 1.55)

Which brings me to a final point – that’s more of a rant really.  This is about lumping and splitting – you say “toma-toe” and I say “tomah-toe” – and what one calls a bacterial “species” (I’m not stepping into that mine field).   We don’t yet know what the observed % divergence at the 16S rRNA gene means in the honey bee gut microbiota.  Could these differences be primarily attributable to diversity between operons within a single strain?  This is unlikely; within the majority of bacterial genomes these operons evolve by concerted evolution and show <1% divergence between gene copies [8].  We are using the 16S rRNA gene as a marker for diversity – as a taxonomic tag.  This short tag is just that – a marker.  It could represent an enormous diversity at the genome level, we don’t yet know.  What we do know is that the microbial world is vast, that diversity is the norm – very few environments are characterized by low species abundance or “clonal” strains.  The fact that we are able to pick up a statistically significant signal between honey bee colonies based on OTUs suggests to me that there is more to investigate here.




Monday, April 2, 2012

Putting our honey bee data in context


Our group recently addressed the effect of within-colony genetic diversity on the associated microbial community of the honeybee Apis melliera [1].  We obtained more than 70,000 pyrosequences from samples of whole worker bees, worker guts, and from bee bread taken from 22 colonies (n= 12 colonies were genetically diverse; n=10 colonies were genetically uniform).  Our research found that the honey bee colonies benefit from the promiscuous mating of queens; diverse colonies were characterized by a reduction in potential pathogens and enrichment for possible probiotic species.  We used well established approaches for clustering (based on 97% sequence identity using average neighbor) and de novo classifying short pyrosequences [2-5] as these sequences are arguably too short for robust phylogenetic analyses [6].  Our approach used the Naïve Bayesian Classifier trained on the Arb-Silva dataset and targeted diversity in the V1-V2 region. The utility of this approach is that we could, through alignment of the 16S rRNA gene, make hypotheses as to what these organisms in the bee gut may be doing, how they might be interacting, without a priori expectations of community composition.

In an important earlier study in 2007, Cox-Foster et al. published a phylogenetic framework for classifying the bacteria that are associated with honey bees [7]. The framework includes 8 phylotypes named Bifidobacterium, γ-1, γ-2, β-1, α-1, α-2, firm-4 and firm-5.  Unlike our de novo approach, these phylotypes were generated based on their groupings on a phylogenetic tree.  The method of analysis employed by Cox-Foster et al. differs from ours in that the diversity within each clade is not predetermined – that is, no % identity threshold is used in generating these groupings – and therefore taxa grouped into a single clade may be highly divergent (that is, below the traditional 97% identity threshold utilized in the field).  We analyzed the sequences generated in the earlier study, and computationally formed % identity clusters to explore the amount of diversity within a clade by progressively clustering the sequences at higher divergence levels using complete linkage clustering (as implemented in RDP Classifier; Table 1) or nearest neighbor clustering (as implemented in blastclust –b T, -L 0.9).  Indeed, each clade holds a relatively large amount of diversity; sequences within each clade are between 3- and 10% divergent (Table 1).  According to the methods utilized in Mattila et al. (2012), and using a 97% identity threshold that is more typical for the field, these clades would be considered to harbor numerous species and/or genera.

Table 1. The number of clusters generated by complete linkage clustering (or nearest neighbor clustering in parenthesis) of the 8 clades characterized in Cox-Foster et al. [7] as a function of percent identity.  Subclusters within clades suggest that these groupings are quite diverse and likely contain several different species/genera.  

Phylotype
90%
93%
95%
97%
Alpha-1
1 (3)
1 (3)
1 (3)
2 (3)
Alpha2-1
1 (2)
1 (2)
1 (2)
2 (4)
Alpha-2-2
1 (1)
1 (1)
3 (1)
5 (3)
Beta
2 (7)
2 (7)
3 (8)
5 (8)
Bifido
1 (2)
1 (2)
1 (2)
1 (2)
Firm-4
1 (2)
2 (2)
3 (4)
5 (5)
Firm-5
1 (2)
2 (2)
2 (2)
4 (2)
Gamma-1
2 (6)
2 (6)
4 (6)
5 (7)
Gamma-2
1 (3)
1 (3)
1 (3)
1 (3)

Below, we compare our data and our approach to that developed by Cox-Foster et al. [7] to contrast the level of diversity that is estimated by both approaches.  We used blastn to identify which of our top 13 most prevalent OTUs (sequences that cluster at 97% identity) corresponded to the 8 clades mentioned above.  If our top OTUs were 100% identical to a sequence representative considered to be part of one the 8 clades, we noted this in Table 2 below.

Table 2. Representative OTUs generated by Mattila et al., 2012, their % prevalence in the bee gut (ranked in terms of abundance) and top blast hit accession numbers from the Cox-Foster et al., 2007 phylotypes.  Where a particular OTU does not find homologs within the 8 phylotype framework proposed by Cox-Foster et al., we indicate that result with N/A.
Classification (Mattila et al., 2012)
Prevalence in the bee gut (Mattila et al., 2012)
Top blast hit (nr/nt)
Phylotype
Succinivibrionaceae
38.8%
HE613303
Gamma-1
Bowmanella
14.3%
DQ837611
Gamma-2
Oenococcus
14.1%
HE613310
Firm-5
Paralactobacillus
10.2%
HM113331
Firm-4
Unclass. Colwelliaceae
6.4%
HM111973
Gamma-1
Bifidobacterium
4.7%
HM113282
Bifido
Shimazuella
3.2%
HE613282
Firm-5
Enterobacter
1.2%
JF208675
N/A
Laribacter
1.0%
JQ437500
Beta
Saccharibacter
0.92%
JQ437507
Alpha-2.1
Rummeliibacillus
0.52%
HM111947
Firm-5
Atopobacter
0.28%
HM113352
Firm-4
Escherichia/Shigella
0.17%
HE582599
N/A


It is important to note that the 13 OTU representatives in Table 2 do not represent our complete dataset; of the 1,019 OTUs we reported in Mattila et al., 2012, only 358/1019 find homologs (>98% ID) in honey bee datasets previously published.  Furthermore, Mattila et al., 2012 takes the step of classifying and analyzing these sequences at a finer taxonomic scale. Two fundamental questions remain to be addressed: 1) is this diversity relevant to honey bee health? and 2) is the level of divergence revealed by our study for the honeybee microbiome important for the function and stability of this community?  Our study suggests that fine scale diversity within the bacterial community (at 97% ID) may be important to the health of a colony; community diversity using this metric correlated with host genetic diversity and with the prevalence of low frequency pathogens Melissococcus and Paenibacillus.

 References