6 Realities of Genomic Research

June 19, 2015 by Dan Koboldt Leave a Comment

The rise of next-generation sequencing has worked wonders for the field of genetics and genomics. It’s also generated a considerable amount of hype about the power of genome sequencing, particularly the possibility of individualized medicine based on genetic information. The rapid advances in technology — most recently, the Illumina X Ten system — have made heretofore impossible large-scale whole-genome sequencing studies feasible. I’ve already written about some of the possible applications of inexpensive genome sequencing.

I’m as excited about this as anyone (with the possible exception of Illumina). Even so, we should keep in mind that not everything is unicorns and rainbows when it comes to genomic research. Here are some observations I’ve made about sequencing-empowered genomic research over the past few years.

1. There is never enough power

“Power” is a term that’s being discussed more and more as we plan large-scale sequencing studies of common disease. In essence, it answers the question, “What fraction of the associated variants can we detect with this study design, given the number of samples, inheritance pattern, penetrance, etc.?” Several years ago, when ambitious genome-wide association studies (GWAS) became feasible, there was a hope that much of the heritability of common disease could be attributed to common variants with minor allele frequencies of, say 5% or more.

If that were true, it was very good news, because:

We could test such variants in large cohorts using high-density SNP arrays, which are inexpensive
Our power to detect associations was high because many samples in each cohort would carry the variants
Associated common variants would “explain” susceptibility in more individuals, narrowing the scope of follow-up.

GWAS efforts have revealed thousands of replicated genetic associations. However, it’s clear that a signification proportion of common disease risk comes from rare variants, which might be specific to an individual, family, or population. To achieve power to detect association for these rare variants, you need massive sample sizes (10,000 or more). You also need to use sequencing, since many of these will not be on SNP arrays (even exome-chip); some might have never been seen before.

Despite the falling costs of sequencing, cohorts of that size require a considerable investment.

2. There will be errors, both human and technical

If the power calculations call for sequencing 10,000 samples, you’d better pad that number in the production queue. Some samples will fail due to technical reasons, such as library failure or contamination. Others may fall victim to human or machine errors. We can address some failures (such as a sample swap) with computational approaches, but others will mean that a sample gets excluded.

The challenge of a 10,000 sample study is that, even with very low error/failure rates, the number of samples that must ultimately be excluded from the study might be a little shocking.

3. Signal to noise problems increase

One of the greatest advantages of whole genome sequencing is that it’s an unbiased survey of genetic variation. It lets us search for associations without any underlying assumptions like “associated variants must be in coding regions.” One potential disadvantage is that we’ll be looking at 3-4 million sequence variants in every genome.

Classic GWAS approaches rely on SNP arrays, which interrogate (on average) 700,000 to 1 million carefully selected, validated, assayable markers. The call rates on those platforms are usually >99%. Now we’re talking about genome-wide sequencing and variant detection. It means we’ll most likely be able to detect variants that contribute to disease risk, but we’ll also have to examine millions of variants that have no effect on it.

In contrast, a candidate gene study or even exome sequencing has the benefit of pre-selecting regions most likely to harbor functional variants. Not only are there fewer variants, but all things being equal they’re more likely to be relevant because they affect proteins.

4. We can’t predict all variant consequences

Annotation tools such as VEP and ANNOVAR have come a long way towards helping us identify computationally which variants are most likely to be deleterious. However, their annotations are based on our knowledge of the genome and its functional elements (which remains incomplete) and our best guess as to which variations cause which effects.

Outside of the coding regions, we face an even greater challenge. That’s where most human genetic variation resides, including the substantial fraction expected to play regulatory roles in the genome. Thus, understanding the mechanism by which associated variants affect disease risk will be a long and difficult prospect. It will likely cost more time and resources than finding those variants in the first place.

5. There’s always a better informatics tool

The incredible power of next-gen sequencing required a new generation of analysis tools simply to handle the new nature and vast scale of data. We’ve done well to address many of the challenges, but developing these tools takes time. Keeping them relevant is a particular struggle, because sequencing technologies continue to rapidly evolve.

I remember a meeting a few years ago when we were working on Illumina short-read sequencing (36 bp reads, possibly even single-end) and wondering if we could find a way to build 100 bp contigs. I remember thinking, if we can get to 100 bp, we’ll be home free.

The current read length on Illumina X Ten is 150 bp. The MiSeq platform (while admittedly not for whole-genome sequencing) does 250 bp. And now that still seems far too short, especially to identify structural variation and interrogate the complex regions of the human genome (repeats, HLA, etc.).

6. You can spin a story about any gene

The huge investments and advances in the field of genetics over the past 50+ years have helped us build an incredible wealth of knowledge about genes and their relationships to human health. Granted, a large number of genes have no known function. Even so, with known disease associations, expression patterns, sequence similarities, pathway membership, and other sources of data, we have a lot to work with when it comes time to explain how a gene might be involved with a certain disease.

There’s a danger in that, because it gives us enough information to spin a story about any gene. To make a plausible explanation on how variation in that gene could be involved in the phenotype of interest. Given that fact, we have to admit that databases and the literature may contain false reports. For example, a recent examination by the ClinGen consortium found that hundreds of variants listed as pathogenic in the OMIM database are now being annotated as benign or of uncertain significanceby clinical laboratories.

With great power comes great responsibility, and at this moment in genomics there is no greater power than large scale whole genome sequencing.

References
Rehm HL, Berg JS, Brooks LD, Bustamante CD, Evans JP, Landrum MJ, Ledbetter DH, Maglott DR, Martin CL, Nussbaum RL, Plon SE, Ramos EM, Sherry ST, Watson MS, & ClinGen (2015). ClinGen–the Clinical Genome Resource. The New England journal of medicine, 372 (23), 2235-42 PMID: 26014595