# Overview

The underlying work that goes into preparing good filters for variant calling is not always exposed to the biologists who use variants for making research and clinical decisions. Filter sets like the GATK best practice pipeline hard filters perform well and are widely used, but lack detailed background on the underlying truth sets and methods used to derive them.

As a result, some scientists treat variant calling and filtering as a solved problem. This creates a disconnect between researchers who work on the underlying algorithms and understand the filtering trade offs and imperfectness of our current analysis tools, and users who take variants as inputs to downstream analysis. Additionally, there is often room for improvement in filters, even widely used ones. We recently tweaked mapping quality filters for GATK from the defaults to provide an improvement in sensitivity without loss of precision.

To provide more clarity into work we do as part of bcbio development, this post describes improved filters for VarDict tumor/normal cancer variant calling. VarDict (v1.4.5) is a freely available small variant (SNP and insertions/deletions) caller from Zhongwu Lai and Jonathan Dry’s group in AstraZeneca Oncology. The new set of filters use the synthetic 4 dataset from the ICGC-TCGA DREAM challenge truth set for validation. This dataset is more difficult than the synthetic 3 dataset used in previous validations since it represents a heterogeneous tumor with 20% normal contamination and two lower frequency sub-clones.

The filters originally derived from the synthetic 3 dataset analysis result in VarDict underperforming on SNP sensitivity and precision, relative to MuTect (v1.1.5):

after adjustment with a new filtering strategy, VarDict is now on par with MuTect for SNP sensitivity and precision, and better than Scalpel (v0.5.1) on sensitivity for insertions and deletions. It also performs well in comparison with VarScan (2.4.1), FreeBayes (v0.9.21-26) and MuTect2 (3.5-0):

The improvement in VarDict come from improved filtering of lower frequency mutations. Detecting these is critical to understanding tumor heterogeneity, but is also more challenging. This is analogous to the way that filters for germline calling often primarily benefit detection of heterozygote calls.

We’ll describe the new filters that led to the improvement, showing plots used to identify the new linear combinations of filtering parameters. Cross-validation with a second evaluation shows that these are general improvements and provide similar benefits on other datasets. We’ll discuss the visual approach used here versus machine learning methods, and how we hope to avoid over-training and devise generally useful filters for detecting low frequency variants.

# Cross validation

To avoid over-optimizing to the DREAM datasets, we also compared caller accuracy against a mixture of two Genome in a Bottle datasets, NA12878 and NA24385, prepared by researchers at the Hartwig Medical Foundation. Using a mixture of the two samples as the tumor and NA24385 alone as a normal allows identification of NA12878 variants as somatic calls. For an mixture of the two samples with 30% NA12878 and 70% NA24385, the expected allele frequencies are 15% for NA12878 heterozygotes not present in NA24385 and 30% for NA12878 homozygotes not in NA24385. While not a real tumor, this approach exercises many of the features of somatic callers.

Our VarDict filter improvements carry over to this dataset and provide good resolution of both SNPs and Indels relative to the other callers:

While most callers have similar patterns of sensitivity and specificity in both the DREAM synthetic 4 and NA12878/NA24385 evaluations, MuTect2 (3.5-0) performs much differently between the two. MuTect2 had the best resolution of SNPs and Indels in the DREAM dataset, but performs the worst in the Genome in a Bottle mixture validation. This suggests that MuTect2, which is still under development at the Broad Institute and released as beta software for research use only, may currently be over optimized to features of the DREAM datasets.

# Filter development

The key insight for improving VarDict sensitivity was that providing a more lenient p-value filter (-P) to VarDict results in detection of many of the initially missing variants. However, setting a lenient p-value filter also results in poor precision. By using this as a starting point and introducing two new filters for low frequency variants, we were able to recover good precision without losing the sensitivity gains.

## Low frequency with low depth

The developed filter provides better discrimination of low frequency variants with low depth, starting with the the recommendation of 13X coverage to accurately detect heterozygote calls. Generalizing heterozygote allele frequency (0.5) and depth (13) to any allele frequency gives us a recommended coverage where variant allele frequency times depth at a position is six or more. We apply the new filters only in lower frequency (AF) and lower depth (DP) calls that fail to meet this threshold. Within this set of calls, we filter variants failing one of 3 other criteria for mapping quality (MQ), number of read mismatches (NM), low depth (DP) or low quality (QUAL). In pseudo-code, the new filter is:

((AF * DP < 6) &&
((MQ < 55.0 && NM > 1.0) ||
(MQ < 60.0 && NM > 2.0) ||
(DP < 10) ||
(QUAL < 45)))


Breaking this down into each component:

• Remove variants supported by reads with low mapping quality and multiple mismatches. Poorly mapped reads with multiple errors contribute to false positive low frequency calls. VarDict reports two useful metrics: the mean number of mismatches (NM) and mean mapping quality (MQ) for the reads covering a position in the genome. For bwa mapping qualities, we filter calls with (MQ < 55.0 and NM > 1.0 or MQ < 60.0 and NM > 2.0). Plotting the relationship between mapping quality (MQ) and number of mismatches (NM) shows the distributions of these metrics. In true positives they cluster near the edges of high mapping quality and a single mismatch:

In false positives there is a more even distribution of the values amongst the ranges of mapping quality and mismatches:

This filter is currently tuned to the bwa aligner (v0.7.12), which uses 60 as the maximum mapping quality. In bcbio, we only use this filter when using the bwa aligner, but we’d also like to generalize by testing with other alignment methods and associating poor mapping quality scores with genome mappability tracks.

• Low depth (DP < 10) – Low depth calls, coupled with AF * DP < 6, contain a disproportionate number of false positives.

• Low quality (QUAL < 45) – Similarly, low quality values correlate with false positive calls.

## Low frequency with poor quality

The first set of filters improved precision but still retained 3x more false positives compared with MuTect. To improve precision further, we looked at the remaining true and false positive calls and found false positives had a combination of low frequency (AF), low quality (QUAL) and high p-values (SSF). The pseudo-code for this filter is:

AF < 0.2 && QUAL < 55 && SSF > 0.06


Breaking down the distribution differences for each of these filters.

• Allele frequency < 0.2

• Quality < 55

• P-value (SSF) > 0.06

# Filter development

We identified these filters by manual inspection of metrics graphs. We run automated calling and validation with bcbio, then plot metrics and look for differences that discriminate true and false positive calls. Practically, we extract metrics from VCF files into tables using bio-vcf, then plot with pandas, seaborn and matplotlib. Since the base set of calls already optimize single metric filters, we start with linear combinations of metrics, subsetting and re-plotting as we identify sets of filters that do a good job of separating signal and noise.

The updated filters for VarDict improve sensitivity and precision on low frequency variants. SNP call rates are on par with MuTect, and VarDict is more sensitive and precise than Scalpel and other callers for insertions and deletions.

Our goal with using the DREAM synthetic truth sets is to discover generally useful metrics, rather than performing well on this particular dataset. To avoid over-training we use our best intuition to select metrics based on depth and quality which will apply generally across other samples. By cross-validating against a second dataset, a mixture of two Genome in a Bottle samples, we show that the results do generalize. Ideally, we’d also include a panel of additional test sets, including real tumors. A recent paper from Malachi Griffith, Chris Miller and colloborators at the McDonnell Genome Institute has an extensively characterized AML31 case with validated variants. The ICGC also published a well characterized set of validated calls on a high depth medulloblastoma sample. We plan to cross-validate on these datasets to continue to improve filtering approaches for heterogeneous, low frequency variants.

This approach is analagous to applying machine learning tools to discriminate true and false positives on the training set, and we’d like to incorporate additional automated approaches as we continue to improve filters. In the past we’ve used support vector machines (SVMs) to learn and apply filtering metrics. This approach worked well, but tended to over-train to specific datasets. Because it was difficult to extract the underlying relationships in the filter, we didn’t have a way to reliably estimate when a filter was general versus over-trained. We’d ideally like to use a hybrid approach where we use machine learning to identify potentially useful linear combinations of metrics, then refine these with manual inspection. We’d love feedback from others with experience doing this type of filtering, and would especially welcome suggestions for practical methods for apply machine learning methods to the initial metric selection steps.

This filter improvement methodology is general and can apply to other callers, although it’s not always easy to derive metrics that will effectively separate true and false calls. For instance, we also wanted to improve MuTect and Scalpel calling on this dataset. MuTect doesn’t offer quality or other annotation metrics to provide a way to build additional filters beyond their implemented heuristics. Scalpel does provide additional metrics and has additional true positive calls filtered by the default calling approach, so we could improve in sensitivity. However, we were unable to improve sensitivity without a large decrease in precision. Unfortunately improving callers is still often a laborious process done on a case-by-case basis. We hope to continue to improve the process by automating validation work and incorporating a large number of diverse validation sets.

# Running the analysis

Full details for running this comparison are available in the bcbio example pipeline documentation. The DREAM synthetic 4 dataset requires obtaining access through ICGC-DACO since the input data comes from real human tumors. It does challenge variant callers more than the unrestricted synthetic dataset 3 due to the added heterogeneity and lower frequency variants.

Once you have access to the input data, bcbio automatically runs alignment, variant calling and validation, producing the validation graphs shown above. The synthetic 4 example includes alignments for both build 37 and 38 as well as structural variant calling, so is a full tumor characterization you can use as a template for calling on your cancer samples. By combining validation and analysis tools we provide a integrated approach that both improves algorithms and runs real research samples.