Validating multiple cancer variant callers and prioritization in tumor-only samples

Overview

The post discusses work validating multiple cancer variant callers in bcbio-nextgen using a synthetic reference call set from the ICGC-TCGA DREAM challenge. We've previously validated germline variant calling methods, but cancer calling is additionally challenging. Tumor samples have mixed cellularity due to contaminating normal sample, and consist of multiple sub-clones with different somatic variations. Low-frequency sub-clonal variations can be critical to understand disease progression but are more difficult to detect with high sensitivity and precision.

Publicly available whole genome truth sets like the NA12878 Genome in a Bottle reference materials don't currently exist for cancer calling, but other groups have been working to prepare standards for use in evaluating callers. The DREAM challenge provides a set of synthetic datasets that include cellularity and multiple sub-clones. There is also on ongoing DREAM contest with real, non-simulated data. In addition, the ICGC has done extensive work on assessing the challenges in cancer variant calling and produced a detailed comparison of multiple variant calling pipelines. Finally, Bina recently released a simulation framework called varsim that they used to compare multiple callers as part of their cancer benchmarking. I'm excited about all this community benchmarking and hope this work can contribute to the goal of having fully consented real patient reference materials, leading to a good understanding of variant caller tradeoffs.

In this work, we evaluated cancer tumor/normal variant calling with synthetic dataset 3 from the DREAM challenge, using multiple approaches to detect SNPs, indels and structural variants. A majority voting ensemble approach combines inputs from multiple callers into a final callset with good sensitivity and precision. We also provide a prioritization method to enrich for somatic mutations in tumor samples without matched normals.

Cancer variant calling in bcbio is due to contributions and support from multiple members of the community:

• Miika Ahdesmaki and Justin Johnson at AstraZeneca collaborated with our group on integrating and evaluating multiple variant callers. Their financial support helped fund our time on this comparison, and their scientific contributions improved SNP and indel calling.
• Luca Beltrame integrated a number of cancer variant callers into bcbio and wrote the initial framework for somatic calling.
• Lorena Pantano integrated variant callers and performed benchmarking work. Members of our group, the Harvard Chan school Bioinformatics core, regularly contribute to bcbio development and validation work.
• The Wolfson Wohl Cancer Research Centre supported our work on validation of somatic callers.
• James Cuff, Paul Edmon and the team at Harvard FAS research computing provided compute infrastructure and support that enabled the large number of benchmarking runs it takes to get good quality calling.

I'm continually grateful to the community for all the contributions and support. The MuTect commit history for bcbio is a great example of multiple distributed collaborators working towards the shared goal of integrating and validating a caller.

Variant caller validation

We used a large collection of open source variant callers to detect SNPs, Indels and structural variants:

bcbio runs these callers and uses simple ensemble methods to combine small variants (SNPs, indels) and structural variants into final combined callsets. The new small variant ensemble method uses a simplified majority rule classifier that picks variants to pass based on being present in a configurable number of samples. This performs well and is faster than the previous implementation that made use of both this approach and a subsequent support vector machine step.

Using the 100x whole genome tumor/normal pair from DREAM synthetic dataset 3 we evaluated each of the callers for sensitivity and precision on small variants (SNPs and indels). This synthetic dataset contains 100% tumor cellularity with 3 subclones at allele frequencies of 50%, 33% and 20%.

In addition to the whole genome results, the validation album includes results from running against the same dataset limited to exome regions. This has identical patterns of sensitivity and precision. It runs quicker, so is useful for evaluating changes to filtering or program parameters.

We also looked at structural variant calls for larger deletions, duplications and inversions. Here is the precision and sensitivity for duplications across multiple size classes:

The full album of validation results includes the comparisons for deletion and inversion events. These comparisons measure the contributions of individual callers within an ensemble approach that attempts to maximize sensitivity and specificity for the combined callset. Keep in mind that each of the individual results make use of other caller information in filtering. Our goal is to create the best possible combined calls, rather than a platform for unbiased comparison of structural variant callers. We're also actively working on improving sensitivity and specificity for individual callers and expect the numbers to continue to evolve. For example, Zev Kronenberg added WHAM's ability to identify different classes of structural changes, and we're still in the process of improving the classifier.

Improvements in filtering

Our evaluation comparisons show best effort attempts to provide good quality calls for every caller. The final results often come from multiple rounds of improving sensitivity and precision by adjusting program parameters or downstream filtering. The goal of tightly integrating bcbio with validation is that the community can work on defining a set of parameters and filters that work best in multiple cases, and then use these directly within the same framework for processing real data.

In presenting the final results only, it may not be clear that plugging a specific tool into a custom bash script will not always produce the same results we see here. As an example, here are the improvements in FreeBayes sensitivity and precision from our initial implementation, presented over the exome regions of synthetic dataset 3:

The original implementation used a vcfsamplediff based approach to filtering, as recommended on the FreeBayes mailing list. The current, improved, version uses a custom filter based on genotype likelihoods, based on the approach in the speeseq pipeline.

In general, you can find all of the integration work for individual callers in the bcbio source code, broken down by caller. For instance, here is the integration work on MuTect. The goal is to make all of the configuration settings and filters fully transparent so users can understand how they work when using bcbio, or transplant them into their own custom pipelines.

Tumor-only prioritization

The above validations were all done on cancer calling with tumor and normal pairs. The filters to separate pre-existing germline mutations from cancer specific somatic mutations rely on the presence of a normal sample. In some cases, we don't have matched normal samples to do this filtering. Two common examples are FFPE samples and tumor cell lines. For these samples, we'd like to be able to prioritize likely tumor specific variations for followup using publicly available resources.

We implemented a prioritization strategy from tumor-only samples in bcbio that takes advantage of publicly available resources like COSMIC, ClinVar, 1000 genomes, ESP and ExAC. It uses GEMINI to annotate the initial tumor-only VCF calls with external annotations, then extracts these to prioritize variants with high or medium predicted impact, not present in 1000 genomes or ExAC at more than 1% in any subpopulation, or identified as pathenogenic in COSMIC or ClinVar.

Validating this prioritization strategy requires real tumor samples with known mutations. Our synthetic datasets are not useful here, since the variants do not necessarily model standard biological variability. You could spike in biologically relevant mutations, as done in the VarSim cancer simulated data, but this will bias towards our prioritization approach since both would use the same set of necessarily imperfect known variants and population level mutations.

We took the approach of using published tumor data with validated mutations. Andreas Sjödin identified a Hepatoblastoma exome sequencing paper with publicly available sample data and 23 validated cancer related variations across 5 samples. This is a baseline to help determine how stringent to be in removing potential germline variants.

The prioritization enriches variants of interest by 35-50x without losing sensitivity to confirmed variants:

sample caller confirmed enrichment additional filtered
HB2T freebayes 6 / 7 44x 1288 56046
HB2T mutect 6 / 7 48x 1014 47755
HB2T vardict 6 / 7 36x 1464 52090
HB3T freebayes 4 / 4 46x 1218 54997
HB3T mutect 4 / 4 49x 961 46894
HB3T vardict 4 / 4 35x 1511 51404
HB6T freebayes 4 / 4 43x 1314 56240
HB6T mutect 4 / 4 51x 946 47747
HB6T vardict 3 / 4 35x 1497 51625
HB8T freebayes 6 / 6 42x 1364 57121
HB8T mutect 6 / 6 47x 1053 48639
HB8T vardict 6 / 6 35x 1542 52642
HB9T freebayes 2 / 2 41x 1420 57582
HB9T mutect 2 / 2 44x 1142 49858
HB9T vardict 2 / 2 36x 1488 53098

We consistently missed one confirmed mutation in the HB2T sample. This variant, reported as a somatic mutation in an uncharacterized open reading frame (C2orf57), may actually be a germline mutation in the study sub-population. The variant is present at a 10% frequency in the East Asian population but only 2% in the overall population, based on data from both the ExAC and 1000 genomes projects. Although the ethnicity of the original samples is not reported, the study authors are all from China. This helps demonstrate the effectiveness of large population frequencies, stratified by population, in prioritizing and evaluating variant calls.

The major challenge with tumor-only prioritization approaches is that you can't expect to accurately filter private germline mutations that you won't find in genome-wide catalogs. With a sample set using a small number of validated variants it's hard to estimate the proportion of 'additional' variants in the table above that are germline false positives versus the proportion that are additional tumor-only mutations not specifically evaluated in the study. We plan to continue to refine filtering with additional validated datasets to help improve this discrimination.

Practically, bcbio automatically runs prioritization with all tumor-only analyses. It filters variants by adding a LowPriority filter to the output VCF so users can readily identify variants flagged during the prioritization.

Future work

This is a baseline for assessing SNP, indel and structural variant calls in cancer analyses. It also prioritizes impact variants in cases where we lack normal matched normals. We plan to continue to improve cancer variant calling in bcbio and some areas of future focus include:

• Informing variant calling using estimates of tumor purity and sub-clonal frequency. bcbio integrates CNVkit, a copy number caller, which exports read count segmentation data. Tools like THetA2, phyloWGS, PyLOH and sciClone use these inputs to estimate normal contamination and sub-clonal frequencies.
• Focusing on difficult to call genomic regions and provide additional mechanisms to better resolve and improve caller accuracy in these regions. Using small variant calls to define problematic genome areas with large structural rearrangements can help prioritize and target the use of more computationally expensive methods for copy number assessment, structural variant calling and de-novo assembly.
• Evaluating calling and tumor-only prioritization on Horizon reference standards. They contain a larger set of validated mutations at a variety of frequencies.

As always, we welcome suggestions, contributions and feedback.