Extracting keywords from biological text using Zemanta

Brad Chapman bio photo By Brad Chapman Comment

Increasingly, my daily work is shifting from a model of "let me do this analysis for you and give you back some data" to "let me provide an interface that allows you to do the analyses yourself." This is great as it allows more close collaboration with lab scientists, and also helps split up work so I can be involved in more projects. One common interface suggestion is a keyword google-like search box; enter some text and find anything related to this term. In implementing this, I wanted to provide reasonable search suggestions by identifying keywords from gene descriptions. These can help frame researchers questions, and prove clues about useful search terms for new users.

Here is an implementation of keyword extraction from biological text using the Zemanta semantic API. The function uses the Zemanta REST interface and parses JSON output with simplejson. The parsed JSON is available as a python dictionary.

[sourcecode language="python"]
def query_zemanta(search_text):
gateway = 'http://api.zemanta.com/services/rest/0.0/'
args = {'method': 'zemanta.suggest',
'api_key': 'YOUR_API_KEY',
'text': search_text,
'return_categories': 'dmoz',
'return_images': 0,
'return_rdf_links' : 1,
'format': 'json'}
args_enc = urllib.urlencode(args)

raw_output = urllib2.urlopen(gateway, args_enc).read()
output = simplejson.loads(raw_output)

print 'First article:', output['articles'][0]
print 'Keywords:', [k['name'] for k in output['keywords']]
for link in output['markup']['links']:
print link['anchor']
for target in link['target']:
if target['type'] in ['wikipedia', 'rdf']:
print '\t', target['title'], target['url']
#print output
print 'Marked up text', output['markup']['text']

In addition to extracting keywords, Zemanta also provides links to online resources. Here is the keyword list:

Keywords: [u'Drosophila', u'Biology', u'Messenger RNA', u'Germ cell',
           u'Mitogen-activated protein kinase', u'Oocyte', 
           u'C-Jun N-terminal kinases', u'RNA']

And here is the original testing text marked up with the automated links:

glh-2 encodes a putative DEAD-box RNA helicase that contains six CCHC zinc fingers and is homologous to Drosophila VASA, a germ-line-specific, ATP-dependent, RNA helicase; GLH-2 activity may also be required for the wild-type morphology of P granules and for localization of several protein components, but not accumulation of P granule mRNA components; GLH-2 interacts in vitro with itself and with KGB-1, a JNK-like MAP kinase; GLH-2 is a constitutive P granule component and thus, with the exception of mature sperm, is expressed in germ cells at all stages of development; GLH-2 is cytoplasmic in oocytes and the early embryo, while perinuclear in all later developmental stages as well as in the distal and medial regions of the hermaphrodite gonad; GLH-2 is expressed at barely detectable levels in males.

In addition to the keywords, Zemanta does an excellent job of automatically annotating the text with links to relevant resources:

  • Most impressively, the JNK acronym is determined to reference C-Jun N-terminal kinases, providing a link to the Wikipedia reference.
  • Zemanta also provides links to Freebase, an open database in RDF-ready format. One example is this useful link to Drosophila from which you could automatically extract NCBI Taxon IDs.
  • The one automated semantic mistake is the link to KGB; it provides a link to the KGB headquarters in Russia.

Zemanta also supports links to NCBI in their latest release (as a funny semantic miscue, the blog link to NCBI is to the wrong NCBI). I did not get any NCBI links with this few example, but it is exciting to see they are thinking about scientific applications.

In addition to Zemanta, I also tested OpenCalais with the python interface, which was not as successful. For the same text above, it returned only a single keyword: the incorrect KGB one. It appears as if their current focus is in finance, but it is worth watching for future developments.

comments powered by Disqus