Structural and Functional Annotation of Genomes through Synchronised Data Warehouses

Funding and Staffing Details

This project is funded by a 2-year BBSRC/EPSRC grant under the Bioinformatics Initiative. The investigators are Alex Poulovassilis and Nigel Martin of the Information Management and Web Technologies and Bioinformatics Groups in the School of Computer Science and Information Systems at Birkbeck, Christine Orengo of the Biomolecular Structure and Modelling Group in the Department of Biochemistry & Molecular Biology at UCL, and Peter Keller of the Macromolecular Structure Database group at the EBI. Dr Michael Maibaum is a postdoctoral researcher working on the project at Birkbeck and UCL. Dr Adrian Shepherd worked on the project before leaving to take up a lectureship in the Department of Crystallography, Birkbeck.

Project Aims

The aim of the project has been to develop data warehousing technology to enable an integrated view of primary structural data held in the MSD database at the European Bioinformatics Institute (EBI) and derived data at UCL. This is a prototype of a future potential network, with EBI as the central hub, providing the basic data, with many peripheral sites hosting derived data, constructed in such a way as to allow a seamless access to both basic and derived data in one query.

During the project we have developed an extensible architecture that can be used to support the integration of such heterogeneous biological data sets. There are three major obstacles in such an endeavour: the use of different identifiers for the same biological entities, the diversity of the data models underpinning the biological data, and the requirement to keep the integrated data warehouse current in the face of data and schema changes in the source data sets. In our architecture, entities are categorised into clusters allowing individual biological entities to be annotated with family based data. For example, sequence based clustering enables gene family based annotation of individual sequences.

We use the AutoMed data integration toolkit to store the schemas of the data sources and also the transformations from the source data into the data of the integrated warehouse. These transformations are generated semi-automatically by a process of schema matching and schema restructuring. The transformations can be used to update the warehouse data as entities change, are added, or are deleted in the data sources. The transformations can also be used to support the addition or removal of entire data sources, or evolutions in the schemas of the data sources or of the warehouse itself.

Further, we have developed mechanisms supporting the transfer and incremental update of the MSD database at remote sites. These mechanisms have been implemented at the Birkbeck/UCL Bloomsbury sites successfully, and represent the first successful implementation of the MSD database and incremental update mechanisms at sites outside the EBI.

Project Publications

BioMap: Gene Family based Integration of Heteregeneous Biological Datbases using AutoMed Metadata M Maibaum, G Rimon, C.Orengo, N Martin, A.Poulovasillis, Proc. 15th International Workshop on Database and Expert Systems Application DEXA 2004, 384-388, (2004).

Cluster based integration of Heterogeneous Biological Databases using the AutoMed toolkit M Maibaum, L Zamboulis, G Rimon, C.Orengo, N Martin, A.Poulovasillis, Proc. 2nd International Workshop Data Integration in the Life Sciences DILS 2005, 191-207, (2005).