One common post processing step for next generation, or short read, sequencing is to remove adaptor sequences. Sample preparation for sequencing often involves selection of a particular fraction of interest like expressed genes, microRNAs, or copy number variant regions. Selecting, amplifying, and sequencing these regions can result in read sequences containing a common tag that needs to be identified and removed before continuing with other analysis steps like alignment to the genome. This is often the case when the reads of interest are small like in short RNAs or small sequence tags.

We start with a short known sequence (the adaptor) that will be present in our sequences of interest. For each read, we want to determine the position of this adaptor and remove it. Additionally, reads may contain the adaptor but differ by one or more bases; these differences are likely due to sequencing errors. So our solution needs to allow some fuzzy matching of the adaptor to the read sequences.

My first attempt involved using fuzzy string matching. In Python, several libraries exist to tackle this problem. The normal use case is in identifying misspelled words in text. For instance, you can calculate Levenshtein (edit) distance or use difflib in the standard library; Stack Overflow has a nice discussion of the different modules. Ultimately this approach proved to be a dead end; the properties that make English words similar do not overlap well with those differentiate DNA sequences.

That realization led me to tackle the problem by computing local alignments of the adaptor to each sequence. This effectively captures the biological intuition you need to trim a sequence with less than a specified number of errors. The downside to this rigorous approach is that it will be slower than purely string based methods.

The Biopython library contains a Python local alignment function suitable for quick alignment of short regions. We use this to align an adaptor region to a sequence and calculate the number of differences in the aligned region. To handle a common case where we can find the exact adaptor sequence, we first do an string match. This avoids the alignment overhead in many cases:

[sourcecode language="python"]
from Bio import pairwise2

gap_char = '-'
if exact_pos >= 0:
else:
seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(seq),
one_alignment_only=True, gap_char=gap_char)[0]
seq_region = seq_a[start:end]
matches = sum((1 if s == adapt_region[i] else 0) for i, s in
enumerate(seq_region))
# too many errors -- no trimming
if (len(adaptor) - matches) > num_errors:
return seq
# remove the adaptor sequence and return the result
else:
right_side)
[/sourcecode]

If the alignment contains fewer than the specified number of errors we've found an adaptor and remove it, returning the trimmed sequence. The attribute right_side allows us to specify whether the trimming should be done on the right (3' end) or the left (5' end):

[sourcecode language="python"]
"""Remove an adaptor region and all sequence to the right or left.
"""
if right_side:
try:
pos = seq.find(region)
# handle Biopython SeqRecords
except AttributeError:
pos = seq.seq.find(region)
return seq[:pos]
else:
try:
pos = seq.rfind(region)
# handle Biopython SeqRecords
except AttributeError:
pos = seq.seq.rfind(region)
return seq[pos+len(region):]
[/sourcecode]

Here is an example script using this function along with Biopython to trim reads from a FASTQ format file.Each read is analyzed to determine if it contains the adaptor, with up to 2 errors. If the adaptor is identified and trimmed, the final useful read is written to a FASTA format output file.

[sourcecode language="python"]
from __future__ import with_statement
import sys
import os

from Bio import SeqIO

def main(in_file):
num_errors = 2
base, ext = os.path.splitext(in_file)
out_file = "%s-trimmed.fa" % (base)

with open(in_file) as in_handle:
with open(out_file, "w") as out_handle:
for rec in SeqIO.parse(in_handle, "fastq"):