A recent thread by Peter on the BioSQL mailing list initiated some thinking about formalizing ontologies and terms in BioSQL. The current ad-hoc solution is that BioPerl, Biopython and BioJava attempt to use the same naming schemes. The worry is that this is not documented, no one is likely in a big hurry to document it, and we are essentially inventing another ontology.
The BioSQL methodology of storing key/value pair information on items can be mapped to RDF triples as:
|Bioentry or Feature||Subject|
|Ontology||Namespace of predicate|
|Term||Predicate term, relative to namespace|
Thus, a nice place to look for ontologies is in standards intended for RDF. Greg Tyrelle thought this same way a while ago and came up with a XSLT to transform GenBank XML to RDF, using primarily the Dublin Core vocabulary. On the biology side, the Sequence Ontology project provides an ontology meant for describing biological sequences. This includes a mapping to GenBank feature table names.
Using these as a starting point, I generated a mapping of GenBank names to names in the Dublin Core and SO ontologies. This is meant as a basis for standardizing and documenting naming in BioSQL. The mapping file thus far covers almost all of the header and feature keys, and more than half of the qualifier keys:
- Tab delimited mapping file
- All of the python code that does the mapping and pulls information from associated files is available: github repository.
I would welcome suggestions for missing GenBank terms, as well as corrections on the terms mapped by hand.
Some notes on the mapping:
- Cross references to other identifiers are mapped with the Dublin Core term 'relation'. These can occur in many places in the GenBank format. Using a single term allows them to be flattened, with mapping values in form of 'database:identifier.' This is consistent with the GenBank /db_xref qualifier.
- Multiple names or descriptions of an item, also stored in multiple places in GenBank files, receive the Dublin Core term 'alternative.'
- Organism and taxonomy ontologies are a whole project onto themselves, so I didn't try to tackle them here.
Some other useful links for biological ontology mapping:
- GoPubMed, which uses the Dublin Core and Prism Standard vocabularies: Example record
- UniProt RDF, which uses primarily its own ontology: Example record
- GenBank feature table