DOME: A database for OMEs
The various sequencing projects provide much of information about the genome of a species. However, to gain insight into the functioning of a cell, we need to understand the function of various genes, proteins and metabolites. Currently, functional genomics is focused on analyzing the expression profiles of RNA and protein molecules expressed in the cell. Recently this has started to include metabolite profiles, even though only a handful of labs are doing this at present (we expect this to grow). There is thus a need to integrate data obtained from these three analyses. We believe that in order to understand the system as a whole it is extremely important to collect information about RNA, protein and metabolite profiles in such a way as to allow them to be combined (i.e. making sure the experimental conditions are compatible). This detailed molecular profiling invariably leads to a large amount of data that can only be managed properly with a bioinformatic system. This needs to be able to provide means to link these different data and to allow statistical analyses across the data types. We have developed a relational database system, that we call DOME, specifically for this purpose.
DOME is composed of various sub-sections: one containing details about experimental design (metadata), another one with raw data, another one with processed data (i.e. analysis results), and finally an ontology describing the known molecular biology of the species of interest (this is called B-Net).
To allow merging data of several types, it is essential that the database captures metadata about the experiments. The database is general enough that it can store different kinds of experiments, such as time series or steady state experiments. The experimental metadata is what allows one to initially combine microrarray, proteomic and metabolite profile samples. Raw data are composed of mass and UV spectra, microarray and gel images. Processed data consist of metabolite, protein and mRNA identities and relative levels. These can be classified in two major classes: separation-spectral data (metabolomics and proteomics) and probe data (microarray).
An issue that becomes very important in functional genomics is that the bioinformatic system should be flexible to allow new methods of data analysis to be added. This stems from the fact that there is still great activity in researching what algorithms will best reveal gene functions from the mass of data produced. Our database system is so designed that it insulates the details of any analysis method from the rest of the system. This allows us to develop the database management independently from the data analysis and, more importantly, to extend the data analysis subsequent to deploying the system. The benefits of this approach will be evident when new algorithms are developed for analysis of raw data. We emphasize that this is possible only because we designed the system to be extendable from the start.
For the gene expression data we have constructed a series of tables that let us capture the intricate details of these experiments and their results. This database, when made available publicly, will be MIAME compliant. For 2D-PAGE/MALDI-MS/ESI-MS proteomics results, we have constructed appropriate tables in the database such that one can easily relate spots in the gel, its approximate molecular weight, its approximate pI, its area, its density, its mass spectrum, and indeed with the protein identity if it has already been determined. Metabolite profiling data consist of chromatograms from various instrument platforms that associate mass or electromagnetic absorption spectra with each chromatographic peaks. Our schema allows any peak of a chromatogram to be annotated with the compound identification as the compounds comprising the peaks become known. Interrelated with this database is a reference database, B-NET, which contains the list of known genes, proteins, metabolites and the enzymatic reactions and which forms an ontology for annotation of results.
