Monday, July 29, 2013

PhyBin: tree binning by topology

Just submitted our first paper to PeerJ - the awesome new, open access journal aimed at tipping the entire publishing establishment on its head.  I'm looking forward to a smooth review process -- hopefully as sleek and helpful as the submission process. <UPDATE: our paper was accepted.  I will post a link to the paper once it's online!>

Our manuscript presents PhyBin, a computer program aimed at binning precomputed sets of trees in Newick format, a file format produced by the majority of tree building software.  As we assert in the manuscript, PhyBin is a utility rather than a complete solution; it can serve as a component in many genomics pipelines, and provides a useful addition to the landscape of tools for dissecting and visualizing large numbers of trees.  After the user applies their chosen ortholog prediction and tree-building algorithms, PhyBin offers a quick way to visualize and browse the different evolutionary histories, either binned by topology and sorted by bin size, or in the form of a full hierarchical clustering based on Robinson-Foulds (RF) distance: i.e. a tree of trees.

In the manuscript, we explore to functionalities in PhyBin: 1) the ability to bin trees with identical topologies and 2) the ability to cluster similar trees by RF distance.  Lots of folks interested in the "landscape" of topologies produced by orthologous genes across a genome use RF distance as a measure of topological similarity.  What is RF distance?  It is essentially the number of different steps you'd have to take to create one tree out of another -- it's the edit distance between two topologies.  So, according to the original Robinson-Foulds publication, for example, the trees below (trees 1 and 2) are edit distance 2 apart because in order to convert one to the other, you must collapse a node and then reform it.

PhyBin does some pretty neat pre-processing of trees to facilitate comparissons. For example, you can set a branch length threshold to collapse branches that are essentially noise in your dataset (say, from very closely related taxa).  It also checks your dataset for number of taxa and is quite robust to file formatting.  Then what PhyBin does is calculate the edit distance for a large group of trees (a distance matrix) and then also displays these distances as a tree of trees - as a dendrogram that links each tree in the dataset to each other based on the edit distances between them (how you'd get from one to another).  In our manuscript, we used a set of orthologs generated from 10 published Wolbachia genomes.  Here's what the dendrogram looks like for those 508 trees without (A) and with (B) clustering by RF distance. In the figure, you can see that many of the trees in the Wolbachia ortholog set are similar, they cluster into 9 large clusters, many of which support the monophyly of Wolbachia supergrops but others (in fact, a good fraction!) which do not.  


Want to download it?  Check it out here





1 comment:

  1. hi..Im student from Informatics engineering, this article is very informative, thanks for sharing :)

    ReplyDelete