The DNA Data Deluge - IEEE Spectrum
The DNA Data Deluge
Fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze
For the first time, researchers had read all 3 billion of the chemical “letters” that make up a human DNA molecule, which would allow geneticists to investigate how that chemical sequence codes for a human being. In his remarks, President Bill Clinton recalled the moment nearly 50 years prior when Francis Crick and James Watson first discovered the double-helix structure of DNA. “How far we have come since that day,” Clinton said.
But the president’s comment applies equally well to what has happened in the ensuing years. In little more than a decade, the cost of sequencing one human genome has dropped from hundreds of millions of dollars to just a few thousand dollars. Instead of taking years to sequence a single human genome, it now takes about 10 days to sequence a half dozen at a time using a high-capacity sequencing machine. Scientists have built rich catalogs of genomes from people around the world and have studied the genomes of individuals suffering from diseases; they are also making inventories of the genomes of microbes, plants, and animals. Sequencing is no longer something only wealthy companies and international consortia can afford to do. Now, thousands of benchtop sequencers sit in laboratories and hospitals across the globe.
DNA sequencing is on the path to becoming an everyday tool in life-science research and medicine. Institutions such as the Mayo Clinic and the New York Genome Center are beginning to sequence patients’ genomes in order to customize care according to their genetics. For example, sequencing can be used in the diagnosis and treatment of cancer, because the pattern of genetic abnormalities in a tumor can suggest a particular course of action, such as a certain chemotherapy drug and the appropriate dose. Many doctors hope that this kind of personalized medicine will lead to substantially improved outcomes and lower health-care costs.
But there’s the catch. As sequencing machines improve and appear in more laboratories, the total computing burden is growing. It’s a problem that threatens to hold back this revolutionary technology. Computing, not sequencing, is now the slower and more costly aspect of genomics research. Consider this: Between 2008 and 2013, the performance of a single DNA sequencer increased about three- to fivefold per year. Using Moore’s Law as a benchmark, we might estimate that computer processors basically doubled in speed every two years over that same period. Sequencers are improving at a faster rate than computers are. Something must be done now, or else we’ll need to put vital research on hold while the necessary computational techniques catch up—or are invented.
How can we help scientists and doctors cope with the onslaught of data? This is a hot question among researchers in computational genomics, and there is no definitive answer yet. What is clear is that it will involve both better algorithms and a renewed focus on such “big data” approaches as parallelization, distributed data storage, fault tolerance, and economies of scale. In our own research, we’ve adapted tools and techniques used in text compression to create algorithms that can better package reams of genomic data. And to search through that information, we’ve borrowed a cloud computing model from companies that know their way around big data—companies like Google, Amazon.com, and Facebook.
Think of a DNA molecule as a string of beads. Each bead is one of four different nucleotides: adenine, thymine, cytosine, or guanine, which biologists refer to by the letters A, T, C, and G. Strings of these nucleotides encode the building instructions and control switches for proteins and other molecules that do the work of maintaining life. A specific string of nucleotides that encodes the instructions for a single protein is called a gene. Your body has about 22 000 genes that collectively determine your genetic makeup—including your eye color, body structure, susceptibility to diseases, and even some aspects of your personality.
Thus, many of an organism’s traits, abilities, and vulnerabilities hinge on the exact sequence of letters that make up the organism’s DNA molecule. For instance, if we know your unique DNA sequence, we can look up information about what diseases you’re predisposed to, or how you will respond to certain medicines.
Ideally, a DNA sequencer would simply take a biological sample and churn out, in order, the complete nucleotide sequence of the DNA molecule contained therein. At the moment, though, no sequencing technology is capable of this. Instead, modern sequencers produce a vast number of short strings of letters from the DNA. Each string is called a sequencing read, or “read” for short. A modern sequencer produces reads that are a few hundred or perhaps a few thousand nucleotides long.
The aggregate of the millions of reads generated by the sequencer covers the person’s entire genome many times over. For example, the HiSeq 2000 machine, made by the San Diego–based biotech company Illumina, is one of the most powerful sequencers available. It can sequence roughly 600 billion nucleotides in about a week—in the form of 6 billion reads of 100 nucleotides each. For comparison, an entire human genome contains 3 billion nucleotides. And the human genome isn’t a particularly long one—a pine tree genome has 24 billion nucleotides.
Thus our first daunting task upon receiving the reads is to stitch them together into longer, more interpretable units, such as genes. For a organism that has never been fully sequenced before, like the pine tree, it’s a massive challenge to assemble the genome from scratch, or de novo.
How can we assemble a genome for the first time if we have no knowledge of what the finished product should look like? Imagine taking 100 copies of the Charles Dickens novel A Tale of Two Cities and dropping them all into a paper shredder, yielding a huge number of snippets the size of fortune-cookie slips. The first step to reassembling the novel would be to find snippets that overlap: “It was the best” and “the best of times,” for example. A de novo assembly algorithm for DNA data does something analogous. It finds reads whose sequences “overlap” and records those overlaps in a huge diagram called an assembly graph. For a large genome, this graph can occupy many terabytes of RAM, and completing the genome sequence can require weeks or months of computation on a world-class supercomputer.
No hay comentarios:
Publicar un comentario