# Finding and displaying short reads clustered in the genome

Short read, or next-generation, sequencing is becoming increasingly prevalent in the day-to-day lives of bioinformatics researchers. Programs such as Bowtie, CloudBurst, Maq, and Arachne are available to align short reads to a reference genome. Once you have these alignments, the data analysis work begins. During a recent project, I was interested in looking at overlapping read groups across a genome; this post describes a method for collecting and visualizing those groups.

Starting with a file from an alignment program, the first step is to write a Python generator that returns information about the alignments. Since there are many different aligners, below is some Python pseudocode which describes the function:

[sourcecode language="python"]
def parse_alignment(align_file, errors_allowed=0):
with open(align_file) as align_handle:
for (read_id, match_id, match_start, match_end, errors) in \
if (errors <= errors_allowed):
[/sourcecode]

It parses a line, checks if it has an acceptable number of alignment errors, and yields a unique id for the read, an id for the match like a chromosome or contig name, and the start and end coordinates of the match. We can use this generator to build our representation of overlapping reads with the help of the excellent bx-python library. bx-python contains a Python and C implementation of clustered interval trees. The function below uses the interface to generate ClusterTree objects for each chromosome/contig in your alignment:

[sourcecode language="python"]
import collections
from bx.intervals.cluster import ClusterTree

def build_cluster_trees(align_generator, cluster_distance):
# arguments to ClusterTree are:
# - Distance in basepairs for two reads to be in the same cluster;
# for instance 20 would group all reads with 20bp of each other
# - Number of reads necessary for a group to be considered a cluster;
# 2 returns all groups with 2 or more overlapping reads
cluster_trees = collections.defaultdict(lambda:
ClusterTree(cluster_distance, 2))
for read_id, match_id, start, end in align_generator:
return dict(cluster_trees)
[/sourcecode]

The generated trees will readily provide a list of start and end coordinates for a clustered region, along with all of the reads that map to that region:

[sourcecode language="python"]
for chrom, cluster_tree in cluster_trees.items():
for start, end, read_ids in cluster_tree.getregions():
[/sourcecode]

My interest was in visualizing the reads in these clusters along with the frequency of each read. To do so, I generated two python dictionaries which map the read_id identifiers, used in the grouping, to the chromosome location of the read (location_db) and the number of times a read was found in the raw short read results (freq_db). Practically, these are implemented as BerkeleyDB key value stores, to avoid having to keep all of the read information in memory.

With this information, we can use matplotlib to visualize the reads. Frequencies are represented using a color map. Each read in a group is plotted using horizontal bar graphs:

[sourcecode language="python"]
import pylab

min_freq = min(l[2] for l in read_info)
max_freq = max(l[2] for l in read_info)
freq_cmap = pylab.get_cmap('Blues')
pylab.clf()
for rindex, (start, end, freq) in enumerate(read_info):
freq_percent = float(freq - min_freq) / float(max_freq - min_freq)
color = freq_cmap(freq_percent)
if freq in [min_freq, max_freq]:
label = "Frequency: %s" % (freq)
else:
label = None
if label and label not in labels_done:
labels_done.append(label)
else:
label = None
pylab.barh(rindex, end - start, left=start, height=0.8, color=color,
label=label)
pylab.title(cluster_name)
pylab.xlabel("Coordinates")