Organization of literature using PubMed related articles

Brad Chapman bio photo By Brad Chapman Comment

When dealing with a long list of journal articles, what is the best method to organize them? I was confronted with this problem in designing an interface where users would pick specific papers and retrieve results tied to them. Presenting them as the raw list was unsatisfying; it is fine for users who know exactly what articles they want, but naive users would have a lot of difficulty finding relevant articles. Even for power users, a better classification system could help reveal items they may not have known about.

The approach I took was to group together papers based on similarity. The NCBI PubMed literature database has links to related articles, which it exposes programmatically though EUtils. Using the Biopython Entrez interface, the first step is to retrieve a dictionary of related IDs for each starting article, ordered by relevance:

[sourcecode language="python"]
def _get_elink_related_ids(self, pmids):
pmid_related = {}
for pmid in pmids:
handle = Entrez.elink(dbform='pubmed', db='pubmed', id=pmid)
record =
cur_ids = []
for link_dict in record[0]['LinkSetDb'][0]['Link']:
cur_ids.append((int(link_dict.get('Score', 0)),
local_ids = [x[1] for x in cur_ids if x[1] in pmids]
if pmid in local_ids:
pmid_related[pmid] = local_ids
return pmid_related

Trying to group directly based on this dictionary will often result in one large group, since many of the articles may be linked together through a few common articles. For instance, a review may be related to several other papers in non-overlapping areas. To make the results as useful as possible we define a maximum and minimum group size, and a two parameters to filter the related lists:

  • overrep_thresh: The percentage of papers an item is related to out of all papers being grouped; the threshold sets a maximum number of papers that can be related. For instance, a value of .25 means that a journal will be related to 25% or less of the total papers.
  • related_max: The number of related papers to use. The best related articles go into the grouping.

These parameters define a filter for our dictionary of related articles:

[sourcecode language="python"]
def _filter_related(self, inital_dict, overrep_thresh, related_max):
final_dict = {}
all_vals = reduce(operator.add, inital_dict.values())
for item_id, item_vals in inital_dict.items():
final_vals = [val for val in item_vals if
float(all_vals.count(val)) / len(inital_dict) <= overrep_thresh]
final_dict[item_id] = final_vals[:related_max]
return final_dict

The filtered list is grouped using a generalized version of the examine_paralogs function used in an earlier post to group together location and duplication information. Sets combine any groups with overlapping articles:

[sourcecode language="python"]
def _groups_from_related_dict(self, related_dict):
cur_groups = []
all_base = related_dict.keys()
for base_id, cur_ids in related_dict.items():
overlap = set(cur_ids) & set(all_base)
if len(overlap) > 0:
new_group = set(overlap | set([base_id]))
is_unique = True
for exist_i, exist_group in enumerate(cur_groups):
if len(new_group & exist_group) > 0:
update_group = new_group | exist_group
cur_groups[exist_i] = update_group
is_unique = False
if is_unique:
return [list(g) for g in cur_groups]

With this list, we want to extract the groups and their articles that fit in our grouping criteria, the minimum and maximum size:

[sourcecode language="python"]
def _collect_new_groups(self, pmid_related, groups):
final_groups = []
for group_items in groups:
final_items = [i for i in group_items if pmid_related.has_key(i)]
if (len(final_items) >= self._min_group and
len(final_items) <= self._max_group):
for item in final_items:
del pmid_related[item]
final_related_dict = {}
for pmid, related in pmid_related.items():
final_related = [r for r in related if pmid_related.has_key(r)]
final_related_dict[pmid] = final_related
return final_groups, final_related_dict

Utilizing these functions, the main algorithm steps through a series of increasingly less stringent parameters picking out groups which fall into our thresholds. Closely related journal articles are grouped first; more general papers with less association will be placed in groups in later rounds:

[sourcecode language="python"]
def get_pmid_groups(self, pmids):
pmid_related = self._get_elink_related_ids(pmids)
filter_params = self._filter_params[:]
final_groups = []
while len(pmid_related) > 0:
if len(filter_params) == 0:
raise ValueError("Ran out of parameters before finding groups")
cur_thresh, cur_related = filter_params.pop(0)
while 1:
filt_related = self._filter_related(pmid_related, cur_thresh,
groups = self._groups_from_related_dict(filt_related)
new_groups, pmid_related = self._collect_new_groups(
pmid_related, groups)
if len(new_groups) == 0:
if len(pmid_related) < self._max_group:
pmid_related = {}
return final_groups

The full code wrapped up into a class is available from the GitHub repository.

This is one approach to automatically grouping a large list of literature to make interesting items more discoverable. With the work being done on full text indexing, the data underlying resources such as iHOP might be used to do these groupings even more effectively using similar algorithms.

comments powered by Disqus