Integration of data management and analysis for genome research
Technological advances in genome research have produced unprecedented volumes of genetic and molecular data that now provide the context for any biological research. However, data access, curation, and analysis have remained challenging areas for continued research and development and often prove to be the bottleneck for scientific progress. Many a paper in bioinformatics or even in general molecular biology these days start out just like the abstract above, with an acknowledgement of the explosive growth of molecular sequence, structure, and expression data. What fit nicely within the printed pages of a thin booklet only 25 years ago now comprises large and increasingly complex databases that are Web-accessible to the public. Figure 1 shows the growth on one major molecular sequence repository - GenBank, maintained at the U.S. National Center of Biotechnology Information (NCBI). The slope of the curve is indeed impressive. However, the actual size of the data sets would seem to be easily dwarfed by database sizes in other commericial, governmental, or even other research fields. What then is the real problem, if any, facing the biology community? I think there are many aspects for consideration. One important facet is that the molecular databases themselves have evolved over the years, and surely many details of database design should have been done differently in hindsight. However, the rapid pace of new data acquisition has so far prevented any major re-design and re-construction of the databases the community is accustomed to. Another critical point is that the data derive from a large variety of sources and are intrinsically heterogenous. There are no uniform standards for data quality and annotation. In these notes I shall not further discuss the challenges faced by the large database providers, but rather I shall review the problem first from the point of a user and then suggest some approaches we have pursued to provide intermediate solutions. 10 Figure 1: Molecular Database Growth (from www.ncbi.nlm.nih.gov/Genbank/genbankstats.html).
Full Text: PDF