# Summarizing next-gen sequencing variation statistics with Hadoop using Cascalog

Improvements in next-generation sequencing technology are leading to ever increasing amounts of sequencing data. With this additional throughput comes the demand for algorithms and approaches that can easily scale. Hadoop offers an open source framework for batch processing large files. This post describes using Cascalog, a Hadoop query language written in Clojure, to investigate quality statistics for variant calling in deeply sequenced regions.

### Biological question

The goal is to improve a variation calling algorithm for next-generation sequencing data. We have a densely sequenced region, where each position has thousands of potential base calls. Each position may be a single base, or a mix of of majority and minority variants. We are filtering variants on 3 metrics of quality:

• Quality score -- The sequencing technology's assessment of the correctness of a base.
• K-mer score -- An estimate of the uniqueness of the region surrounding the base call position, built using khmer. Unique regions are more likely to be sequencing artifacts, while common regions are more likely to be real.
• Mapping score -- The aligner's estimate of the reliability of the read alignment.

Each read and position is in a tab delimited file that looks like:

951     G       31      0.0515130211584 198

The training data has a set of known variable positions, and details about how the current variant calling algorithm did at each position:

951     T       false_positive  0.7
953     A       true_positive   4.0


We wanted to generate summary statistics at each position of interest, and look for additional criteria that could be built into the calling algorithm.

### Writing cascalog queries

Cascalog is based on the Datalog rule language, a subset of Prolog. You describe the rules of a system and the query optimizer figures out how best to satisfy them; it requires a change of mindset from the more standard approach that you need to write detailed instructions about what to do.

Cascalog provides a high level of abstraction over Hadoop and Map-Reduce, so you focus entirely on writing the query. This post from Antonio Piccolboni compares several Hadoop languages; the post provides a nice side-by-side example of the brevity you can achieve with Cascalog.

The main query defines the outputs, retrieves input data from the snpdata and location target files described above, provides a count of reads at each position and base of interest, then averages the kmer, quality and mapping score metrics described earlier:

[sourcecode language="clojure"]
(defn calc-snpdata-stats [snpdata targets]
(??<- [?chr ?pos ?base ?count ?avg-score ?avg-kmer-pct
?avg-qual ?avg-map ?type]
(snpdata ?chr ?pos ?base ?qual ?kmer-pct ?map-score)
(targets ?chr ?pos ?base ?type)
(ops/count ?count)
(ops/avg ?kmer-pct :> ?avg-kmer-pct)
(ops/avg ?qual :> ?avg-qual)
(ops/avg ?map-score :> ?avg-map)
(combine-score ?kmer-pct ?qual ?map-score :> ?score)
(ops/avg ?score :> ?avg-score)))
[/sourcecode]

A big advantage of Cascalog is that it is just Clojure, so you can write custom queries in a full-featured language. The last two lines of the query define a custom score and its average at a position. The custom score is a linear combination of the min-max normalized scores:

[sourcecode language="clojure"]
(defn min-max-norm [score minv maxv]
(let [trunc-score-max (if (< score maxv) score maxv)
trunc-score (if (> trunc-score-max minv) trunc-score-max minv)]
(/ (- trunc-score minv) (- maxv minv))))

(defmapop combine-score [kmer-pct qual map-score]
(+ (min-max-norm kmer-pct 1e-5 0.10)
(min-max-norm qual 4.0 35.0)
(min-max-norm map-score 0.0 250.0)))
[/sourcecode]

The final part of the code involves parsing the files and producing the snpdata and targets inputs to the query. That code splits each line in the input file and assigns the parts to the variables of interest:

[sourcecode language="clojure"]
(defmapop parse-snpdata-line [line]
(let [[space pos base qual kmer-pct map-score] (split line #"\t")]
[space (Integer/parseInt pos) base (Integer/parseInt qual)
(Float/parseFloat kmer-pct) (Integer/parseInt map-score)]))

(defn snpdata-from-hfs [dir]
(let [source (hfs-textline dir)]
(<- [?chr ?pos ?base ?qual ?kmer-pct ?map-score]
(source ?line)
(parse-snpdata-line ?line :> ?chr ?pos ?base ?qual
?kmer-pct ?map-score))))
[/sourcecode]

The full project is available on GitHub. To run on a configured Hadoop system, you build the code, copy your input files to HDFS, then run:

% lein deps
% lein uberjar
% hadoop fs -put your_variation_data.tsv /tmp/snp-assess/data
% hadoop fs -put positions_of_interest.tsv /tmp/snp-assess/positions
snp_assess.core /tmp/snp-assess/data /tmp/snp-assess/positions


The same code can also run locally without Hadoop. This is extremely useful for testing and development, or for smaller datasets that do not require the distributed power of Hadoop:

% lein deps
% lein run :snp-data /directory/of/varation/data /directory/of/positions


Both approaches generate tabular output with our positions, counts, scores and average metrics:

| 951 | T |  3 | 0.9 | 2.0e-04 | 24.7 | 55.7  | false_positive |
| 953 | A | 10 | 1.5 | 1.6e-02 | 23.1 | 175.5 | true_positive  |