|
1. KEGG: Kyoto Encyclopedia of Genes and Genomes
1.1. KEGG Databases
KEGG is a bioinformatics resource for understanding higher order functional meanings and utilities of the biological systems, such as the cell, the organism, and the biosphere, from genomic and molecular information.
In order to link genomes to biological systems, the KEGG resource is categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY), and hierarchical classifications involving various aspects of biological systems (KEGG BRITE).
Database |
Content |
Source |
PATHWAY |
Protein interaction and reaction networks for metabolism, various cellular processes, and human diseases |
Manually entered from published materials |
GENES |
GENES: Gene catalogs of complete genomes with manual annotation |
Generated from RefSeq and other public resources with reannotation by KEGG |
DGENES: Gene catalogs of draft genomes with automatic annotation |
EGENES: Gene catalogs (consensus contigs) of EST data with automatic annotation |
GENOME: Genome maps and organism information |
SSDB: Sequence similarities with best-hit information for identifying ortholog/paralog clusters and conserved gene clusters |
Computationally derived from GENES by pairwise genome comparisons of all protein-coding genes |
EXPRESSION: Microarray gene expression profiles |
Microarray data obtained by the Japanese groups |
LIGAND |
COMPOUND: Chemical compound structures |
Manually entered from published materials |
DRUG: Chemical structures of drugs |
GLYCAN: Glycan structures |
REACTION: Chemical reactions |
RPAIR: Chemical structure transformation patterns |
ENZYME: Enzyme nomenclature |
Generated from IUBMB/IUPAC nomenclature |
BRITE |
Functional hierarchies representing our knowledge on various aspects of biological systems including KO (KEGG Orthology) grouping |
Manually entered from published materials |
See also: Kanehisa, M., Goto, S., Hattori, M., Aoki-Kinoshita, K.F., Itoh, M., Kawashima, S., Katayama, T., Araki, M., and Hirakawa, M.; From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 34, D354-357 (2006).
[pubmed]
[pdf]
1.2. Graph Representation
It is useful to know that KEGG is based on the concept of graph for representation and manipulation of data. Mathematically, a graph is a set of nodes (building blocks) and edges (interactions or relations). There are three types of data for the molecular objects of genes, proteins, and chemical compounds.
Graph |
Node |
Edge |
Main Databases |
Protein network |
Protein (Gene product) |
Generalized protein interaction (direct protein-protein interaction, gene expression relation, enzyme-enzyme relation) |
PATHWAY |
Gene universe |
Gene |
Adjacency on chromosome, Sequence/structural similarity, Expression similarity, etc. |
GENES, SSDB, KO |
Chemical universe |
Chemical compound |
Chemical reaction, Structural similarity |
LIGAND |
Another important concept in KEGG is the level of abstraction, which is represented by nested graphs. A nested graph is a graph whose nodes can themselves be graphs. Thus, a subgraph at one level corresponds to a node at a higher level. Examples are the following.
Higher-level node |
Subgraph |
Database |
KEGG Orthology (KO) group |
Set of genes |
KO |
Pathway module |
Set of proteins |
PATHWAY |
Protein family |
Set of proteins |
BRITE |
1.3. Network Hierarchy
The protein network is the most unique data object in KEGG, which is stored as a collection of pathway maps (diagrams) in the PATHWAY database.
Reflecting the map resolution, the KEGG protein network or the PATHWAY database is organized in a hierarchy.
The top two levels in the current hierarchy is the following.
First Level |
Second Level |
Metabolism |
Carbohydrate Metabolism
Energy Metabolism
Lipid Metabolism
Nucleotide Metabolism
Amino Acid Metabolism
Metabolism of Other Amino Acids
Glycan Biosynthesis and Metabolism
Biosynthesis of Polyketides and Nonribosomal Peptides
Metabolism of Cofactors and Vitamins
Biosynthesis of Secondary Metabolites
Biodegradation of Xenobiotics |
Genetic Information Processing |
Transcription
Translation
Sorting and Degradation
Replication and Repair |
Environmental Information Processing |
Membrane Transport
Signal Transduction
Signaling Molecules and Interaction |
Cellular Processes |
Cell Motility
Cell Growth and Death
Cell Communication
Endocrine System
Immune System
Nervous System
Sensory System
Development
Behavior |
Human Diseases |
Neurodegenerative Disorders
Infectious Diseases
Metabolic Disorders
Cancers |
1.4. KEGG Orthology (KO)
Originally, the integration of pathway information and genomic information was first achieved in KEGG by the EC numbers.
Once the EC numbers were correctly assigned to enzyme genes in the genome, organism-specific pathways could be generated automatically by matching against the networks of EC numbers (enzymes) in the reference metabolic pathways.
However, in order to incorporate non-metabolic pathways and to overcome various problems inherent in the enzyme nomenclature, a new scheme based on the ortholog IDs was introduced replacing the EC numbers.
KO (KEGG Orthology) is a further extension of ortholog IDs based on not only the pathway maps but also the BRITE functional hierarchies, most notably classifications of protein families.
Identifier |
Purpose |
EC number |
Mapping enzyme genes to metabolic pathways |
Ortholog ID |
Mapping genes to both metabolic and regulatory pathways |
KO |
Mapping genes to both pathways and BRITE hierarchies |
Thus, under the current KO system, the KO identifiers (K numbers) are placed at the fourth (lowest) level in the network hierarchy shown above, or at the lowest level of the BRITE hierarchy.
1.5. BRITE Functional Hierarchy
The BRITE database is a collection of hierarchical text files and binary relation files.
It is intended to supplement the PATHWAY database in two ways.
One is to computerize higher-level knowledge that cannot easily be represented as molecular interaction/reaction networks, in terms of the hierarchically structured vocabulary.
The other is to inntegrate our knowledge about the genomic space (K numbers) with different types of knowledge in the chemical space (C/D/G/R/A numbers in the LIGAND database).
The BRITE collection is currently categorized as follows.
Top Category |
Second Category |
Genes and Proteins |
Network hierarchy
Protein families |
Compounds and Reactions |
Compounds
Compoound interactions
Reactions |
Drugs and Diseases |
Drugs
Diseases |
Cells and Organisms |
Organisms |
2. DBGET/LinkDB: Integrated Database Retrieval System
2.1. Web of Molecular Biology Data
DBGET/LinkDB is the backbone retrieval system for all GenomeNet databases including a number of molecular biology databases that are mirrored at the GenomeNet.
DBGET/LinkDB is based on a flat-file view of molecular biology databases, where the database is considered as a collection of entries.
Because each entry is given a unique entry name (or an accession number) within a database, the molecular biology databases in the world can be retrieved uniformly by the combination of the database name and the entry name:
In KEGG an organism is a collection of genes, which may also be considered as a flat-file database.
Any gene or gene product (protein or RNA) in KEGG can thus be specified by the combination of the organism name and the gene name:
When two data entries are related in any way, it is customary to incorporate cross-reference information in the molecular biology databases.
Examples include links between sequence data and literature data or between amino acid sequence data and nucleotide sequence data.
The link information between two entries is a binary relation represented by:
database1:entry1 --> database2:entry2
LinkDB is a collection of all such direct links in the GenomeNet databases as well as indirect links that are computationally obtained by combining multiple links and/or using links in reverse directions.
It is interesting to note that the web of molecular biology databases can be considered as another type of graph, consisting of database entries as nodes and cross-reference links as edges.
It is a huge graph somewhat similar to the World Wide Web (WWW).
Graph |
Node |
Edge |
World Wide Web |
Page |
Hyperlink |
Web of molecular biology data |
Database entry |
Cross-reference link |
KEGG gene universe |
Gene |
Any relation between genes or gene products |
KEGG protein network |
Protein |
Protein interaction or relation in known pathways |
KEGG chemical universe |
Chemical compound |
Chemical reaction |
Chemical compound |
Atom |
Atomic bond |
Glycan structure |
Monosaccharide |
Glycosidic bond |
2.2. Databases Available
The following is the GenomeNet databases, many of which are daily updated.
Database | Content | Source |
---|
*DNA | Generic database name representing: GenBank+EMBL | |
*Protein | Generic database name representing: SwissProt+PIR+PRF+PDBSTR |
*nr-nt | Non-redundant DNA database constructed from GenBank and EMBL |
*nr-aa | Non-redundant Protein database constructed from SwissProt, TrEMBL, TrEMBL_new, PIR, PRF, and GenPept |
*RefSeq | Generic database name representing: RefNuc+RefSeq | NCBI |
*RefNuc | NCBI reference nucleotide sequence database |
*RefPep | NCBI reference protein sequence database |
*GenBank | GenBank nucleic acid sequence database by (including DDBJ) | NCBI |
*GenPept | Translated GenBank |
*EMBL | EMBL nucleic acid sequence database | EBI |
*SwissProt | SwissProt protein sequence database | ExPASy / EBI |
*TrEMBL | TrEMBL protein sequence database | EBI / ExPASy |
*TrEMBL_new | TrEMBL_new protein sequence database |
PIR | NBRF-PIR protein sequence database | NBRF |
PRF | PRF (Protein Research Foundation) protein sequence database | PRF |
*PDB | RCSB Protein Data Bank for 3D structures | RCSB |
*PDBSTR | Protein Data Bank reorganized as a sequence database |
EPD | Eukaryotic promoter database by Philipp Bucher | ISREC |
PROSITE | Dictionary of protein sites and patterns by Amos Bairoch | ExPaSy |
BLOCKS | Blocks of conserved segments by Henikoff and Henikoff | FHCRC |
ProDom | Protein Domain database by Corpet, Gouzy, and Kahn | INRA |
PRINTS | Protein motif fingerprint database by Attwood et al. | UMBER |
Pfam | Protein families and motifs by Washington U. and Sanger Centre | Wash.U / Sanger |
*COMPOUND | Chemical compounds | Kyoto |
*DRUG | Chemical structures of drugs |
*GLYCAN | Carbohydrate structures |
*REACTION | Chemical reactions |
*RPAIR | Reactant pairs and alignments |
*ENZYME | Enzyme nomenclature |
*PATHWAY | KEGG pathway maps and ortholog group tables |
*KO | KEGG Orthology |
*GENES | KEGG gene catalogs |
GENOME | KEGG organisms |
DGENES | KEGG draft genome gene catalogs |
EGENES | KEGG EST gene catalogs (consensus contigs) |
EXPRESSION | KEGG microarray gene expression profiles |
VGENOME | Viral genomes reorganized from RefSeq |
VGENES | Viral genes reorganized from RefSeq |
OGENES | Organella genes reorganized from RefSeq |
*OMIM | Online Mendelian Inheritance in Man | NCBI |
PMD | Protein mutation database by Ken Nishikawa | DDBJ |
AAindex | Amino acid index database by Kyoto U. | Kyoto |
LITDB | PRF protein/peptide literature database (published as Peptide Information) | PRF |
Medline | Biomedical literature database located at NCBI | NCBI |
*LinkDB | Database of database links maintained by Kyoto U. | |
* Daily/weekly updated databases
3. Computation Services
3.1. Sequence Analysis
System | Content | Developer | Server |
BLAST | Sequence similarity search | NCBI | Kyoto |
FASTA | Sequence similarity search | W.Pearson | Kyoto |
MOTIF | Sequence motif search | ICR, Kyoto | Kyoto |
CLUSTALW | Multiple sequence alignment | D. Higgins et al. | Kyoto |
MAFFT | Multiple sequence alignment | K. Katoh | Kyoto |
PRRN | Multiple sequence alignment | O. Gotoh | Kyoto |
EGassembler | Generation of consensus contigus | Ali Masoudi-Nejad | Tokyo |
KAAS | Automatic genome annotation | Y. Moriya | Kyoto |
See also other computation services
3.2. Chemical Analysis
System | Content | Developer | Server |
SIMCOMP | Compound substructure search | M. Hattori | Kyoto |
SUBCOMP | Compound substructure search | N. Tanaka | Kyoto |
KCaM | Glycan structure search | K.F. Aoki | Kyoto |
e-zyme | Reaction prediction | M. Kotera | Kyoto |
4. GenomeNet Addresses
Last updated: November 10, 2010
|