Database-primer
[Login to edit this page]
Relational database concepts of computer science and Information retrieval concepts of digital libraries are important for understanding biological databases. Biological database design, development, and long-term management is a core area of the discipline of bioinformatics. Data contents include gene sequences, textual descriptions, attributes and ontology classifications, citations, and tabular data. These are often described as semi-structured data, and can be represented as tables, key delimited records, and XML structures. Cross-references among databases are common, using database accession numbers.
Biological databases are an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life.
Biological knowledge is distributed amongst many different general and specialized databases. This sometimes makes it difficult to ensure the consistency of information. Biological databases cross-reference other databases with accession numbers as one way of linking their related knowledge together.
An important resource for finding biological databases is a special yearly issue of the journal Nucleic Acids Research (NAR). The Database Issue of NAR is freely available, and categorizes many of the publicly available online databases related to biology and bioinformatics.
Biological data comes in many formats. These formats include text, sequence data, protein structure and links. Each of these can be found from certain sources, for example:
The International Nucleotide Sequence Database (INSD) (http://www.insdc.org/) consists of the following databases.
The three databases, DDBJ (Japan), GenBank (USA) and EMBL Nucleotide Sequence Database (Europe), are repositories for nucleotide sequence data from all organisms. All three databases accept nucleotide sequence submissions, and then exchange new and updated data on a daily basis to achieve optimal synchronisation between them. These three databases are primary databases, as they house original sequence data.
Strictly speaking a metadatabase can be considered a database of databases, rather than any one integration project or technology. They collect data from different sources and usually make them available in new and more convenient form, or with an emphasis on a particular disease or organism.
These databases collect organism genome sequences, annotate and analyze them, and provide public access. Some add curation of experimental literature to improve computed annotations. These databases may hold many species genomes, or a single model organism genome.
Since discovery in the area of protein structure has not evolved quite as quickly as discoveries in the area sequence data, due to the 3D nature of protein structure, less information is available for it. Nonetheless, data can be accessed through the RCSB Protein Data Bank at (http://www.pdb.org), SCOP-Structural Classification of Proteins- at (), and CATH at ().
0 Comments
Write a comment