The genomic era in which we live today has provided us with a deluge of DNA sequence data thanks to the rapid development of massive sequencing technologies, says Juan Montoya-Burgos. With billions of bytes of new data every month, scientists set out to solve the last remaining uncertainties in the evolutionary tree of life.
However, it soon emerged that due to the inherent complexity of the evolutionary process, certain patterns can be misinterpreted even when using our best methods on big datasets. In some cases, a large dataset could be even more misleading than its smaller counterpart.
One such case arises when the organisms that are being compared display highly different rates of DNA evolution. Although this pattern arises due to natural reasons, it poses a challenge to our current methods to infer evolutionary history. The reason lies in multiple nucleotide changes per DNA position and in the misinterpretation of convergent bases as being inherited from a common ancestor. In genomic era datasets, this misinterpretation is pervasive and thus, the evolutionary history of some species in the tree of life remains obscured.
In the laboratory of Dr. Juan I. Montoya-Burgos of the Department of Genetics and Evolution and the Institute of Genetics and Genomics in Geneva (iGE3), researchers invented a method and developed an algorithm to tackle this problem. Especially tailored for the large sequence datasets of the genomic era, the method uses an objective criterion to identify subsets of species evolving at a homogeneous rate on each gene. With this information, large datasets can be built in which misleading data has been removed.
The new algorithm, named Locus Specific Species Subsampling (LS³), was validated on simulated DNA sequence data. To prove the usefulness of the new LS³ method in biological data, it was also applied to well-known DNA and protein sequence datasets in which heterogeneous evolutionary rates among species misled the inference resulting in incorrect evolutionary trees. In all cases, the LS³ algorithm succeeded in recovering the correct evolutionary tree from the datasets.
Developing such algorithms is a crucial step towards the full understanding of evolutionary history in the midst of the genomic data deluge, filtering the useful information from the noise. This method was published on February 23 in Molecular Biology and Evolution.