IST-2004-027510: Asssociation Studies assisted by Inference and Semantic Technologies

Partners and funding
Centre for Research and Technology Hellas – Informatics and Telematics Institute (CERTH-ITI) Greece
RAMIT – Research in Advanced Medical Informatics and Telematics - University Hospital Ghent (RAMIT) Belgium
Charite Universitätsmedizin Berlin, Gynäkologie mit Hochschulambulanz (Charite, CBF) Germany
SWORD Technologies S.A. (SWORD) Luxemburg
Aristotle University of Thessaloniki, Lab of Medical Informatics (AUTH) Greece
Birkbeck College, University of London (Birkbeck) UK
Benchmark Performance Ltd (Benchmark) UK
Institute of Communication and Computer Systems - National Technical University of Athens (ICCS-NTUA) Greece
EbioIntel S.L. (EbioIntel) Spain
Pouliadis Associates Corporation (Pouliadis) Greece
Custodix (CUSTODIX) Belgium

EU funding: 2.6 Miillion Euros

Project summary

Cervical cancer is the second most common cancer worldwide with 60000 new cases and 30000 deaths each year in Europe alone, despite a significant progress in early diagnosis and treatment. Recent trends in medical research combine genetic with clinical data and perform association studies among environmental agents, virus characteristics and genetic attributes, in order to identify new markers of risk, diagnosis and prognosis. While the number of studies describing phenotype-genotype associations is rapidly increasing, progress is hindered by the segmentation of various efforts and their corresponding datasets.

The main objective of ASSIST is to facilitate the research for cervical cancer through a system that will virtually unify multiple patient record repositories, physically located in different medical centers/hospitals. Innovative, knowledge-intensive semantic modelling, fuzzy inferencing and data mining techniques will be developed to this end. ASSIST’s inference engine will translate medical concepts into syntactic values that legacy systems may perceive and support the process of evaluating medical hypotheses and contacting association studies .

The unification of participating archives, containing both clinical and genetic data, into a single medical knowledge source will increase research flexibility by allowing the formation of study groups “on demand” and by recycling patient records in new studies. This approach is expected to benefit the study of gynecological neoplasias, whose evaluation requires long-term studies including referral to patients’ antecedents and descendants.

ASSIST will incorporate a quality assurance mechanism to resolve security and ethical issues. The consortium comprises four IT research partners, four developers, and three research hospitals. The gynecological clinics in these hospitals, already owning a sizeable amount of clinical and genetic data, will a ttempt to uncover relations between HPV, patient habits and patient genotype.

Project objectives

Cervical cancer is the second most common cancer worldwide. Infection by the human papillomavirus (HPV) is accepted as the central risk factor for cervical cancer; however, it is unlikely to be the sole cause for developing cancer. Findings indicate that other factors in addition to HPV infection are likely to be important determinants in cervical carcinogenesis. Ongoing research includes investigating the role of specific genetic and environmental factors in determining HPV-persistence and subsequent progression of disease. Association studies among (i) genetic characteristics (genotype) and (ii) environmental agents and virus characteristics (phenotype) can suggest pathogenetic mechanisms that will provide new markers of risk, diagnosis and prognosis, and possibly treatment. As in most common diseases, like heart disease and diabetes, the number of studies describing phenotype-genotype associations in cervical cancer is rapidly increasing; however, similar studies show variation in the underlying association between genotype and outcome between the populations studied. Association studies require large sets of patient phenotypic (clinical data on how the disease is expressed, i.e. virus characteristics, clinical test data, patient lifestyle, etc) and genotypic data (e.g. polymorphisms of a gene) all provided in a structured format. Genotypic data that has become available through molecular genetic testing has been used to study suspected gene polymorphisms for cervical cancer. However, due to cost limitations, in many studies patient sample sizes have been inadequate. Additionally, the phenotypic characteristics used in these studies are not standardised and are usually very limited. Although clinical records contain a wealth of patient data that could be associated with genotypic information, manually extracting all of it for a large number of patients and presenting it in a standardised form, in order to perform a clinical study, would be a very time consuming and expensive process.

The overall objective of ASSIST is to provide medical researchers of cervical cancer with an integrated environment that will virtually unify multiple patient record repositories, physically located at different laboratories, clinics and/or hospitals. This environment will enable researchers to combine phenotypic and genotypic data, utilise existing patient records from several clinics, and, eventually, perform biomedical research in a low-cost and time-efficient way. In fact, ASSIST will exploit its internal Medical Knowledge Base and Inference Engine in order to automate the process of evaluating medical hypotheses of the type used in Association Studies. From a medical research perspective, the above major objective of ASSIST translates into overcoming two stereotypes. The first is that of performing “research in isolation”, a practice that leads to making mostly unreliable statistical analysis at an extremely high cost and/or after long patient data collection periods. The second is that of “disposable patient study groups”, i.e., collections of patient records gathered for a single experiment and never reused. Contrary to the above stereotypes, ASSIST prospect is to increase research flexibility by allowing the construction of study groups “on demand”, collecting data from multiple participating medical archives of possibly diverse nature, internal structure and location. On top of that, the collected patient records are expected to largely incorporate data gathered during regular operation of the participating clinics including old examination results and past findings in a reusable manner. This approach is expected to be of major benefit in studying gynecological neoplasias, including cervical carcinoma, whose evaluation requires long-term studies including referral to patients’ antecedents and descendants.

State of the art ASSIST is expected to build on recent progress in medical informatics together with state of the art approaches in knowledge representation and processing technologies. In fact, these two domains seem to converge in the sense that “knowledge-aware” medical information systems are expected to allow efficient search, possess combinatorial and summarization capabilities, and support medical decisions in a context-sensitive manner. These three valuable characteristics are becoming really necessary as (i) large numbers of electronic health records (EHR) are been produced by hospital information systems (HIS) at increasing rates, (ii) medical knowledge is been electronically encoded, (iii) medical entities tend to have an increasing complexity – especially pushed by the genomics revolution.

Facing the demand for “knowledge awareness”, research in medical informatics seeks solutions to three –sometimes overlapping– problems, namely: (a) Medical data representation, (b) Medical knowledge representation, and (c) Medical inference. As far as medical data representation is considered, WG I of CEN/TC251 leads standardization in Europe. Work Group I, partnered by European industrial and academic groups, attempts to produce abstract and specific schemas for the formal description of patient records (incl. demographic data and history), examination results, as well as guidelines and protocols followed in medical practice. Their approach, greatly influenced by the work of openEHR consortium, discriminates the conceptual structure (archetypes) of a medical entity from the corresponding application/location specific details (ontologic part of the description). Archetype Definition Language (ADL) – an abstract conceptual language – has been proposed by openEHR for possible inclusion in the forthcoming ENV 13606 EHR standard. In parallel, UCL’s Centre for Health Informatics and Multiprofessional Education (CHIME) leads the EHRcom task force following up the Synapses project which produced the Synapses Federated Healthcare Record (SynFHCR) hierarchy. A similar, independent, approach is followed by the Systematic Software Engineering (SSE) system (of Aarhus County in Denmark). On the other hand, the attempts for representation of medical knowledge emerge as a natural evolution of the data abstraction approaches already mentioned.

Essentially, medical knowledge comprises two types of information: (i) formal abstract definitions of medical entities (such as diseases, symptoms, etc) and (ii) implication rules of a “cause and effect” form. WG II of CEN/TC251 attempts to produce relative standards. Among the most interesting approaches, the Galen framework and its contributors have offered software tools and GRAIL language for defining and handling mainly type-(i) knowledge by means of their Common Reference Model ontology. The third type of problems, medical inference, is, in ASSIST’s perspective, the process of exploiting medical knowledge in order to “search in” or “make decisions on the basis of” collections of medical data. The latter can be of two major types: (i) medical information thesauri or (ii) real patient data of any origin. A number of projects and research publications address this problem for type-(i) data. An interesting approach of inference on type-(ii) data has been followed within CLEF project ([CLEF]). ASSIST’s major contribution lies in this area too. In a broader perspective, research related to knowledge representation and processing is governed by the so called “semantic approach”. The related research is mainly performed within the artificial intelligence, computer vision and theoretic informatics communities. It is greatly boosted by the recent developments regarding the semantic annotation of audiovisual content e.g. MPEG-21 and also the so called semantic web initiative. Methodologically, related scientific attempts split into four areas: (i) representation of the (collections of) semantic entities in various forms of semantic encyclopedias (i.e., knowledge bases) where fuzzy set theory, ontologies, formal description languages and tools including XML, RDF and OWL are currently employed, (ii) inference mechanisms where fuzzy inference and descriptive logic play the key role, (iii) supervised or even unsupervised instantiation of semantic definitions in terms of relations (usually fuzzy) between associated entities where neurofuzzy techniques and data mining dominate and (iv) instantiations of semantic encyclopedias for specific knowledge domains by the corresponding experts. ASSIST is expected to span all four areas.

In order to facilitate association studies (associating genotypic and phenotypic factors related to cervical cancer) ASSIST resorts to medical inference applied on real patient data. Following the semantic approach, ASSIST will step on the standards and research results briefly presented earlier in order to build its Medical Knowledge Base. The targeted virtual unification of the participating archives and interpretation of their content relies upon the semantic indexing of their records. Unlike the conventional way of treating stored medical information as alphanumeric data structures whose interpretation is carried out by the human user, ASSIST’s inference engine will

In addition to this inference engine, the proposed architecture will incorporate two important interfacing modules: The first is the interface to its users (mainly medical researchers). It will offer expressive tools for posing their queries. The most interesting among these tools would be a graphical environment for submitting research hypotheses whose validity is to be evaluated. This interface will be medical knowledge aware in the sense that it will allow expression of domain specific queries and particular hypotheses by referring to medical ontologies (including gene related ones) contained in the Medical Knowledge Base. The second type of interfaces will support exchange of data between ASSIST’s core engine and the participating medical archives in a way transparent to the end user. These interfaces will appropriately translate and dispatch user queries to the specific “language” of each legacy archive. Translation will be supported by the semantic inference engine of ASSIST. Reversely, the collected responses will be reformatted to a uniform set of semantic and syntactic virtual records. The latter will be the actual statistical sample for testing medical hypotheses and drawing association measures. To develop the above components, the IT researchers in this project will work in close and constant collaboration with the medical research partners in the ASSIST consortium. In fact, each one of the three participating hospitals in Germany, Greece, and Belgium has been assigned a software R&D partner to elicit requirements and medical expertise and assist with installation and training at the later stages of the project. Upon successful completion, the ASSIST environment aspires to function as an important technology enabler for cervical cancer research by allowing any medical group active in this area to use its facilities and/or contribute their own results.
Last update: Feb 10, 2006