The DNA Data Deluge - IEEE Spectrum
The DNA Data Deluge
Fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze
In June 2000, a press conference was held in the White House to announce an extraordinary feat: the completion of a draft of the human genome.
For the first time, researchers had read all 3 billion of the chemical “letters” that make up a human DNA molecule, which would allow geneticists to investigate how that chemical sequence codes for a human being. In his remarks, President Bill Clinton recalled the moment nearly 50 years prior when Francis Crick and James Watson first discovered the double-helix structure of DNA. “How far we have come since that day,” Clinton said.
But the president’s comment applies equally well to what has happened in the ensuing years. In little more than a decade, the cost of sequencing one human genome has dropped from hundreds of millions of dollars to just a few thousand dollars. Instead of taking years to sequence a single human genome, it now takes about 10 days to sequence a half dozen at a time using a high-capacity sequencing machine. Scientists have built rich catalogs of genomes from people around the world and have studied the genomes of individuals suffering from diseases; they are also making inventories of the genomes of microbes, plants, and animals. Sequencing is no longer something only wealthy companies and international consortia can afford to do. Now, thousands of benchtop sequencers sit in laboratories and hospitals across the globe.
DNA sequencing is on the path to becoming an everyday tool in life-science research and medicine. Institutions such as the Mayo Clinic and the New York Genome Center are beginning to sequence patients’ genomes in order to customize care according to their genetics. For example, sequencing can be used in the diagnosis and treatment of cancer, because the pattern of genetic abnormalities in a tumor can suggest a particular course of action, such as a certain chemotherapy drug and the appropriate dose. Many doctors hope that this kind of personalized medicine will lead to substantially improved outcomes and lower health-care costs.
But while much of the attention is focused on sequencing, that’s just the first step. A DNA sequencer doesn’t produce a complete genome that researchers can read like a book, nor does it highlight the most important stretches of the vast sequence. Instead, it generates something like an enormous stack of shredded newspapers, without any organization of the fragments. The stack is far too large to deal with manually, so the problem of sifting through all the fragments is delegated to computer programs. A sequencer, like a computer, is useless without software.
But there’s the catch. As sequencing machines improve and appear in more laboratories, the total computing burden is growing. It’s a problem that threatens to hold back this revolutionary technology. Computing, not sequencing, is now the slower and more costly aspect of genomics research. Consider this: Between 2008 and 2013, the performance of a single DNA sequencer increased about three- to fivefold per year. Using Moore’s Law as a benchmark, we might estimate that computer processors basically doubled in speed every two years over that same period. Sequencers are improving at a faster rate than computers are. Something must be done now, or else we’ll need to put vital research on hold while the necessary computational techniques catch up—or are invented.
How can we help scientists and doctors cope with the onslaught of data? This is a hot question among researchers in computational genomics, and there is no definitive answer yet. What is clear is that it will involve both better algorithms and a renewed focus on such “big data” approaches as parallelization, distributed data storage, fault tolerance, and economies of scale. In our own research, we’ve adapted tools and techniques used in text compression to create algorithms that can better package reams of genomic data. And to search through that information, we’ve borrowed a cloud computing model from companies that know their way around big data—companies like Google, Amazon.com, and Facebook.
Think of a DNA molecule as a string of beads. Each bead is one of four different nucleotides: adenine, thymine, cytosine, or guanine, which biologists refer to by the letters A, T, C, and G. Strings of these nucleotides encode the building instructions and control switches for proteins and other molecules that do the work of maintaining life. A specific string of nucleotides that encodes the instructions for a single protein is called a gene. Your body has about 22 000 genes that collectively determine your genetic makeup—including your eye color, body structure, susceptibility to diseases, and even some aspects of your personality.
Thus, many of an organism’s traits, abilities, and vulnerabilities hinge on the exact sequence of letters that make up the organism’s DNA molecule. For instance, if we know your unique DNA sequence, we can look up information about what diseases you’re predisposed to, or how you will respond to certain medicines.
The Human Genome Project’s goal was to sequence the 3 billion letters that make up the genome of a human being. Because humans are more than 99 percent genetically identical, this first genome has been used as a “reference” to guide future analyses. A larger, ongoing project is the 1000 Genomes Project, aimed at compiling a more comprehensive picture of how genomes vary among individuals and ethnic groups. For the U.S. National Institutes of Health’s Cancer Genome Atlas, researchers are sequencing samples from more than 20 different types of tumors to study how the mutated genomes present in cancer cells differ from normal genomes, and how they vary among different types of cancer.
Ideally, a DNA sequencer would simply take a biological sample and churn out, in order, the complete nucleotide sequence of the DNA molecule contained therein. At the moment, though, no sequencing technology is capable of this. Instead, modern sequencers produce a vast number of short strings of letters from the DNA. Each string is called a sequencing read, or “read” for short. A modern sequencer produces reads that are a few hundred or perhaps a few thousand nucleotides long.
The aggregate of the millions of reads generated by the sequencer covers the person’s entire genome many times over. For example, the HiSeq 2000 machine, made by the San Diego–based biotech company Illumina, is one of the most powerful sequencers available. It can sequence roughly 600 billion nucleotides in about a week—in the form of 6 billion reads of 100 nucleotides each. For comparison, an entire human genome contains 3 billion nucleotides. And the human genome isn’t a particularly long one—a pine tree genome has 24 billion nucleotides.
Thus our first daunting task upon receiving the reads is to stitch them together into longer, more interpretable units, such as genes. For a organism that has never been fully sequenced before, like the pine tree, it’s a massive challenge to assemble the genome from scratch, or de novo.
How can we assemble a genome for the first time if we have no knowledge of what the finished product should look like? Imagine taking 100 copies of the Charles Dickens novel A Tale of Two Cities and dropping them all into a paper shredder, yielding a huge number of snippets the size of fortune-cookie slips. The first step to reassembling the novel would be to find snippets that overlap: “It was the best” and “the best of times,” for example. A de novo assembly algorithm for DNA data does something analogous. It finds reads whose sequences “overlap” and records those overlaps in a huge diagram called an assembly graph. For a large genome, this graph can occupy many terabytes of RAM, and completing the genome sequence can require weeks or months of computation on a world-class supercomputer.