Mapping individual gene data on an evolutionary tree
 

B. Mirkin, T.I. Fenner, G. Loizou

________________________________________________________________________

 

Project outline and aims
 

Evolutionary trees are an important instrument in inter-genome analysis. Traditionally, computational biology focuses on the problem of tree building. This problem can be formulated as follows: given some data on a set of extant species, build a (rooted) tree whose leaves correspond to the extant species and interior nodes to their ancestors, in such a way that more similar species get later divergence events leading to them. This project is devoted to a related problem - developing methods of interpretation of various types of data on the extant species by mapping them in a biologically meaningful way onto an evolutionary tree and annotating the tree nodes with relevant evolutionary events.

In particular, we are concerned with three specific projects:

  • With I. Muchnik (Rutgers NJ USA) and M. Vingron (currently Berlin Germany), we develop mathematical models and algorithms addressing the following problem. Given an evolutionary species tree and a set of trees built on the same extant species according to similarity between individual gene families, find a mapping of the individual gene trees onto the species tree exhibiting gene duplications and losses to account for the differences. We have developed a so-called annotating model for comparing gene and species trees and established its relations with two other existing models: reconciled tree and lca mapping, see

    O. Eulenstein, B. Mirkin, and M. Vingron (1997) Comparison of annotating duplication, tree mapping, and copying as methods to compare gene trees with species trees, in B. Mirkin, F. McMorris, F. Roberts, and A. Rzhetsky (Eds.) Mathematical Hierarchies and Biology, DIMACS Series, V. 37, Providence: AMS, 71-94.

    O. Eulenstein, B. Mirkin, and M. Vingron (1998) Duplication-based measures of difference between gene and species trees, Journal of Computational Biology, 5, 135-148.

    B. Mirkin (2004) Mapping gene family data onto evolutionary trees, in M. Chavent, O. Dordan, C. Lacomblez, M. Langlais, and B. Patouille (Eds.), Comptes rendus des 11es Rencontres de la Societe Francophone de Classification, University of Bordeaux, 61-68.

  • With E. Koonin and Y. Wolf (NCBI Bethesda USA), we develop algorithms and run computations addressing the following problem. Given an evolutionary species tree and a patterns of presence/absence of a number of genes in the extant species, find hypothetical evolutionary scenarios explaining the patterns by phenomena of gene emergence, horizontal transfer, and gene loss at various extant and ancestor tree nodes. With an algorithm developed for finding maximally parsimonious scenarios, we applied it to about 3000 COG phylogenetic patterns on a set of 26 species with different gene gain penalty weights. The last ultimate common ancestor, LUCA, corresponding to equal loss and gain penalty weights counted 572 genes and appeared most compatible with the hypothetical real ancestor, which led us to suggest that the horizontal transfer was as frequent as the loss. Now this approach is being extended to handle both the maximum likelihood criterion and usage of information on similarities between proteins representing the same gene. This will help in better reconstructing genome contents of ancestor species as well as delineating gene histories, see

    B. Mirkin, T. Fenner, M. Galperin and E. Koonin (2003) Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes, BMC Evolutionary Biology 2003, 3:2.

    B. Mirkin and E. Koonin (2003) A top-down method for building genome classification trees with linear binary hierarchies, in M. Janowitz, J.-F. Lapointe, F. McMorris, B. Mirkin, and F. Roberts (Eds.) Bioconsensus, DIMACS Series, V. 61, Providence: AMS, 97-112.

    K.S. Makarova, Y.I. Wolf, S.L. Mekhedov, B. Mirkin and E.V. Koonin (2005) Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell, Nucleic Acids Research, 2005, Vol. 33, No. 14, 4626-4638.

  • With P. Kellam (UCL London UK) we work on analysis of evolutionary gains and losses of functions in herpesvirus genomes, partly supported by a Wellcome Trust grant (2004-2007). In this project, we have built an evolutionary tree over 30 animal herpesvirus genomes and annotated it with HPFs comprising about 3500 proteins (it was originally 740 HPFs in VIDA database, later aggregated to 593 HPFs within the project), by using methods for maximal parsimonious/likelihood reconstructions developed in the project. The annotation gave us functional reconstructions of all hypothetical ancestral genomes, of which currently the most important are the last ultimate ancestor of herpesviridae (HUCA) and ancestors of the three major herpes virus super-families the alpha, beta and gamma herpesviridae. Our computational method, involving aggregation of protein families and mapping them to the tree, leads to the reconstruction of HUCA closely matching that produced manually by other authors using biological knowledge and intuition, which gives credibility to our other reconstructions.

    Two major discoveries are:

    (1) Clear-cut cases of an ancestral protein sequence diverged so much in the process of evolution that there is no similarity between its descendants in the extant species; still we were able to determine their orthologous character computationally by matching (i) our reconstruction results and (ii) information on gene arrangement in the genomes.

    (2) Surprisingly, in spite of considerable international efforts in determining functions of viral proteins, the function of 85% of those ancestral genes fine-tuning the separation of the beta and gamma super-families from the rest remain unknown.

    On the computational side, VIDA database HPFs were updated by extensive search through major bioinformatics databases whereby we overcame numerous inconsistencies between different submissions. Our annotation of the tree with the original VIDA HPFs showed that further aggregation of the HPFs was needed. Thus, we had to develop a novel clustering method involving protein neighbourhoods, majority lists, data recovery clustering and the similarity scale shift. The latter, a crucial parameter, has been adjusted by computationally iterating with domain knowledge: first by using HPFs with known functions, then by comparing the reconstructed histories with gene arrangements in the genomes. We also developed maximum likelihood versions of our approach involving either node-specific or constant probabilities of loss/gain.

    Some materials:

    B. Mirkin, R. Camargo, T. Fenner, G. Loizou, P. Kellam (2006) Aggregation of Homologous Protein Families (HPFs) for mapping them onto an evolutionary tree, MASAMB - Mathematical and Statistical Aspects of Molecular Biology, Dublin, April 2006.

    B. Mirkin, R. Camargo, T. Fenner, G. Loizou, P. Kellam (2006) Aggregating Homologous Protein Families in evolutionary reconstructions of herpesviruses, 2006 IEEE Symposium on Comp. Intelligence in Bioinformatics & Comp. Biology, 255-263, Toronto, September 2006.

    B. Mirkin, R. Camargo, T. Fenner, G. Loizou, P. Kellam (2007) Using domain knowledge and shift of origin in clustering similarity data (submitted).

    Subjects for student projects

    Extending the algorithm for parsimoniously mapping of gain and loss events to unresolved evolutionary trees.

    Using similarities between proteins for selecting an evolutionary scenario of a gene.

    Finding and visualising evolutionary events for individual gene families.