Automated retrieval of expression data with Python and R

Brad Chapman bio photo By Brad Chapman Comment

In February, I'll be headed to Japan for BioHackathon 2010, which focuses on integrating biological datasets. In preparation, I've been thinking about use cases: when working with a set of data, what other information do you want to able to readily retrieve that will help you better answer your questions? One common set of data is gene expression information; you identify a group of genes of interest and are curious about their expression levels.

In an ideal semantic world, you would be able to build a query through a set of RDF triples like:

  • Expression set -> has organism -> Mus musculus
  • Mus musculus -> has cell type -> proB
  • proB -> has biological state -> wild type
  • Expression set -> has identifier -> A RefGene ID
  • A RefGene ID -> has expression value -> 7.65

So you could find expression sets unique to your cell type under standard conditions, and then identify the expression values of your genes. Ideally, expression sets would also have additional meta-data associated with them so you'd know what 7.65 meant in the context of the experiment.

Although we can't do this query through semantic channels, there are a number of toolkits that help us answer the question in an automated way. NCBI's Gene Expression Omnibus (GEO) provides an online resource hosting expression data. You can query the data sets with EUtils web API using Biopython. Once a data set of interest is identified, Bioconductor's GEOquery can retrieve the data sets and mappings to UCSC RefGene identifiers. All of this is tied together with the Rpy2 python interface to R.

At the top level, we define our goal, which is to retrieve expression data for mouse proB wild type cells:

[sourcecode language="python"]
def main():
organism = "Mus musculus"
cell_types = ["proB", "ProB", "pro-B"]
email = ""
save_dir = os.getcwd()
exp_data = get_geo_data(organism, cell_types, email, save_dir,

def _is_wild_type(result):
"""Check if a sample is wild type from the title.
return result.samples[0][0].startswith("WT")

We query GEO for the possible data sets based on our organism and cell type, and then search through the results for a dataset that matches our criteria. In real life work, your personal _is_wild_type function may be an interactive dialog that presents the biologist with possible experiments, allowing them to select the one of interest. Note also that we store our final result locally, as a Python pickle. Practically, this reduces our load on external servers by only querying them once:

[sourcecode language="python"]
def get_geo_data(organism, cell_types, email, save_dir, is_desired_result):
save_file = os.path.join(save_dir, "%s-results.pkl" % cell_types[0])
if not os.path.exists(save_file):
results = cell_type_gsms(organism, cell_types, email)
for result in results:
if is_desired_result(result):
with open(save_file, "w") as out_handle:
cPickle.dump(result, out_handle)

with open(save_file) as save_handle:
result = cPickle.load(save_handle)

The hard work of querying GEO and retrieving the results is done by Biopython's Entrez interface. After building up the query, the results are parsed into simple objects with a description of the expression set along with titles and identifiers for each of the samples that match our cell type:

[sourcecode language="python"]
def cell_type_gsms(organism, cell_types, email):
"""Use Entrez to retrieve GEO entries for an organism and cell type.
""" = email
search_term = "%s[ORGN] %s" % (organism, " OR ".join(cell_types))
print "Searching GEO and retrieving results: %s" % search_term

hits = []
handle = Entrez.esearch(db="gds", term=search_term)
results =
for geo_id in results['IdList']:
handle = Entrez.esummary(db="gds", id=geo_id)
summary =
samples = []
for sample in summary[0]['Samples']:
for cell_type in cell_types:
if sample['Title'].find(cell_type) >= 0:
samples.append((sample['Title'], sample['Accession']))
if len(samples) > 0:
hits.append(GEOResult(summary[0]['summary'], samples))
return hits

Once the experiment is selected we've accomplished the first part of our goal. Next, the plan is to get expression data values. We do this by building up a dictionary that maps RefGene IDs to expression values, so the results look like:

[sourcecode language="python"]
WT ProB, biological rep1
[('NM_177327', [7.7430266269999999, 6.4795213670000003, 8.8766985500000004]),
('NM_001008785', [7.4671954649999996, 5.4761453329999998]),
('NM_177325', [7.3312364040000002, 11.831475960000001]),
('NM_177323', [6.9779868059999997, 6.3926399939999996]),
('NM_177322', [5.0833683379999997])]

The actual hard work is retrieving the expression data and mappings to gene identifiers, and this is done using the GEOquery library in Bioconductor. Mapping information for expression values, and high level meta-data about the experiment, are stored in local files. The expression information is stored as a R style tab delimited data table, while the meta-data is stored in JSON format. Both sets of data are then present locally in readily readable formats for further processing:

[sourcecode language="python"]
def _write_gsm_map(self, gsm_id, meta_file, table_file):
"""Retrieve GEO expression values using Bioconductor, saving to a table.
robjects.r.assign("", gsm_id)
robjects.r.assign("table.file", table_file)
robjects.r.assign("meta.file", meta_file)
gsm <- getGEO(
write.table(Table(gsm), file = table.file, sep = "\t", row.names = FALSE,
col.names = TRUE)
cat(toJSON(Meta(gsm)), file = meta.file)

The previous function provides a mapping between probe names and expression levels. To be useful, we also need to map each of the probes on the expression array to biological gene identifiers. This is done through a second call to the GEO library to retrieve details for the expression platform. Again the data is presented in a R data frame, which we limit to only the items of interest and save as a tab delimited file:

[sourcecode language="python"]
def _write_gpl_map(self, gpl_id, gpl_file):
"""Retrieve GEO platform data using R and save to a table.
robjects.r.assign("", gpl_id)
robjects.r.assign("gpl.file", gpl_file)
gpl <- getGEO( <- subset(Table(gpl), select=c("ID", "RefSeq.Transcript.ID"))
write.table(, file = gpl.file, sep = "\t", row.names = FALSE,
col.names = TRUE)

Putting this all together, we download each of the mapping files and then parse them, building the connection: expression to probe ID to gene transcript IDs. The final dictionary maps these gene identifiers to the expression values and returns them:

[sourcecode language="python"]
def get_gsm_tx_values(self, gsm_id, save_dir):
"""Retrieve a map of transcripts to expression from a GEO GSM file.
gsm_meta_file = os.path.join(save_dir, "%s-meta.txt" % gsm_id)
gsm_table_file = os.path.join(save_dir, "%s-table.txt" % gsm_id)
if (not os.path.exists(gsm_meta_file) or
not os.path.exists(gsm_table_file)):
self._write_gsm_map(gsm_id, gsm_meta_file, gsm_table_file)

with open(gsm_meta_file) as in_handle:
gsm_meta = json.load(in_handle)
id_to_tx = self.get_gpl_map(gsm_meta['platform_id'], save_dir)
tx_to_vals = collections.defaultdict(list)
with open(gsm_table_file) as in_handle:
reader = csv.reader(in_handle, dialect='excel-tab') # header
for probe_id, probe_val in reader:
for tx_id in id_to_tx.get(probe_id, []):
return tx_to_vals

The full script puts all of these parts together into a working example. In the end we go from our query to the expression data we need, although the route is a bit more circuitous then the idealized one described at the start of the post. Simplifying this type of question based approach to data access is one challenge bioinformatics tool developers will continue to face.

comments powered by Disqus