Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing

Jason O'Rawe, Sun Guangqing, Wei Wang, Jinchu Hu, Paul Bodily, Lifeng Tian, Hakon Hakonarson, Evan Johnson, Zhi Wei, Tao Jiang, Kai Wang, Gholson Lyon and Yiyang Wu

For all author emails, please log on.

Genome Medicine 2013, 5:28 doi:10.1186/gm432
Published: 27 March 2013

Abstract (provisional)

Background

To facilitate the clinical implementation of genomic medicine by next-generation sequencing, it will be critically important to obtain accurate and consistent variant calls on personal genomes. Multiple software tools for variant calling are available, but it is unclear how comparable these tools are or what their relative merits in real-world scenarios might be.

Methods

We sequenced 15 exomes from four families using the Illumina HiSeq 2000 platform and Agilent SureSelect v.2 capture kit, with ~120X mean coverage. We analyzed the raw data using near-default parameters with 5 different alignment and variant calling pipelines (SOAP, BWA-GATK, BWA-SNVer, GNUMAP, and BWA-SAMTools). We additionally sequenced a single whole genome using the Complete Genomics (CG) sequencing and analysis pipeline, with 95% of the exome region being covered by 20 or more reads per base. Finally, we attempted to validate 919 SNVs and 841 indels, including similar fractions of GATK-only, SOAP-only, and shared calls, on the MiSeq platform by amplicon sequencing with ~5000X average coverage.

Results

SNV concordance between five Illumina pipelines across all 15 exomes is 57.4%, while 0.5-5.1% variants were called as unique to each pipeline. Indel concordance is only 26.8% between three indel calling pipelines, even after left-normalizing and intervalizing genomic coordinates by 20 base pairs. 11% of CG variants that fall within targeted regions in exome sequencing were not called by any of the Illumina-based exome analysis pipelines. Based on targeted amplicon sequencing on the MiSeq platform, 97.1%, 60.2% and 99.1% of the GATK-only, SOAP-only and shared SNVs can be validated, but only 54.0%, 44.6% and 78.1% of the GATK-only, SOAP-only and shared indels can be validated. Additionally, our analysis of two families, one containing four individuals and the other containing seven, demonstrates additional accuracy gained in variant discovery by having access to genetic data from a multi-generational family.

Conclusions

Our results suggest that more caution should be exercised in genomic medicine settings when analyzing individual genomes, including interpreting positive and negative findings with scrutiny, especially for indels. We advocate for renewed collection and sequencing of multi-generational families, so as to increase the overall accuracy of whole genomes.