Data Analysis: Today’s Next-Gen Sequencing Imperative

From our September 1 issue: Scientists are finding ways to decipher vast NGS datasets to get the information they need to transform medicine.

John Russell

Without question, next-generation sequencing (NGS) has been immensely successful, even if it hasn’t transformed medicine just yet. Sequencing itself has become nearly trivial. Declining costs have made it widely accessible. Recently, the rise of affordable benchtop NGS instruments promises to democratize the technology further, extending it out from large sequencing centers into smaller labs and clinics. Today, the bigger challenge is deciphering the vast NGS datasets of wide-ranging data types—RNA-Seq, ChiP-Seq, exome, etc.—in order to inform biomedical questions ranging from evolutionary heritage to the functioning of cellular molecular machinery. NGS data analysis was among the many topics discussed at last month’s “NGX: Applying Next-Generation Sequencing” meeting. Integrating various NGS data into networks that are both manageable in size and likely to be true was the core of a talk from MIT’s Ernest Fraenkel, Ph.D., associate professor of biological engineering. Interpreting high-throughput data, he noted, can be like reading “The Hitchhikers’ Guide to the Galaxy,” in which the meaning of life and the universe turns out to be the number 42. “Suddenly you realize you didn’t understand what the question was,” Dr. Fraenkel said. “That’s often true of high-throughput data. We get different answers and we don’t know what they mean. Our integrative approach is to try to discover a biological process that gives rise to the experimental data we detect.” Two recent papers describe his approach, which does not rely on published literature or traditional pathway analysis. Rather, the method uses only the physical interactions. The basic idea is to connect those interactions within true biological networks of manageable size. Dr. Fraenkel uses a graph-based Prize-Collecting Steiner Tree (PCST) algorithm to build networks. Drawing upon roughly a quarter-million physical interactions reported from experiments and data from a given NGS experiment, this PCST-based method identifies highly probable networks. Every interaction—including the quarter of a million he’s already gathered—is given a probability based on reliability factors such as the experimental method used and number of times reported. The number of possible network connections is huge. PCST winnows them down. “You collect as many prizes—high confidence interactions—as you can in the final network,” he said. “But if you just do that, you still get a hairball”—or a dense usable network. So, “you tell it one more thing. You say every time you use an edge to connect something, you have to pay a price for the edge, and the price goes up the less reliable it is,” Dr. Fraenkel continued. “High-confidence edges are cheap; low-confidence edges are expensive. You ask it to collect as many prizes as possible while paying as little as possible for those edges. That forces the algorithm to decide whether or not to connect something to a bunch of connections to get to other data points—that whole chain of connections has to be really high confidence.” Dr. Fraenkel and his colleagues have set up a website with links to several tools including its PCST tool. “You can upload a list of genes and press a button and it sends an email back when it’s solved the problem,” he said.

Deciphering Sample Heterogeneity

Sample heterogeneity is a significant challenge when analyzing NGS data. The problem is well captured in a 2011 PLOS One paper from conference speaker Ting Gong, Ph.D., assistant professor of molecular carcinogenesis at MD Anderson Center and her colleagues. “RNA prepared from heterogeneous tissue samples might contain only a fraction of the total cell subpopulation of interest. Consequently, the expression signal of any gene detected directly from a complex sample is a convolution of expressions of all present cell types,” Dr. Gong et al. wrote. “If tissues or cells are used without consideration of such a mixing phenomenon, measurement of differential gene expression will certainly be confounded by the heterogeneous cell populations.” She has since extended a version of her computational method, first used with microarrays, for use with NGS data. Speaking at “NGX”, Dr. Gong discussed a statistical pipeline for distinguishing heterogeneous tissues and cell types based on RNA-Seq data. The method works by generating gene signatures by analyzing data from the ‘‘pure’’ samples—or training data—and applying these signatures to estimate the mixing fractions for the complex samples. “We tested our methods on several well-controlled benchmark datasets with known mixing fractions of pure cell or tissue types and mRNA expression profiling data from samples collected in a clinical trial. Accurate agreement between predicted and actual mixing fractions was observed,” Dr. Gong noted in a published excerpt from this study. “In addition, our method was able to predict mixing fractions for more than ten species of circulating cells and to provide accurate estimates for relatively rare cell types.” Dr. Gong’s group has released an open-source R software package, dubbed DeconRNASeq, for other researchers to use.

A Critical Mass

Click Image To Enlarge +

The New York Genome Center provides both computational work and data storage for collaborating institutions and commercial organizations. [Rita Rose Photography]

Fostering collaboration and providing tools able to span research and the clinic remain pressing needs, said Toby Bloom, Ph.D., deputy scientific director of informatics at the New York Genome Center (NYGC). Her talk provided a glimpse into emerging needs. “We have two goals here at NYGC,” she said. “One is to have a genome center that is really focused on clinical and on bringing genomics into the clinic. The second is to provide help for collaboration and sharing of research projects and data and analysis across participating organizations.” The center, she said, is providing both computational work and data storage for participating organizations and their collaborators. “Some will choose to keep their data here. Some will choose to keep their data locally, perhaps for legal reasons. At the least we can provide a common area for them to come and say ‘I’m working on this project and I am going to need 5,000 more genomes to get enough power to validate this finding, does anybody else have those? Do I have to find 50,000 more subjects and sequence them, or does somebody have these that match these criteria already?’” Dr. Bloom continued. “We want to be able to answer that question for them.” “Think of this effort as a community cloud but it’s not a public cloud so we can provide all the security and HIPAA compliance and the IRB and informed consent kind of security and privacy restrictions you need in a medical setting,” she added. “We are in the process of becoming CLIA certified. Not all of the work—sequencing, et cetera—needs to be CLIA, but things that are being used in patient care must be validated in a CLIA lab.”

Already Delivering Surprises

Click Image To Enlarge +

Next-generation sequencing methods have revealed that foreign RNAs are common in human plasma. [Pacific Northwest Diabetes Research Institute (PNDRI)]

Speaker David Galas, Ph.D., principal scientist, Pacific Northwest Diabetes Research Institute (PNDRI), discussed work in which he and his colleagues discovered exogenous RNA in plasma. “Many times in science you stumble across something unexpected,” said Dr. Galas. “Using next-gen sequencing, we were trying to identify what RNAs were in plasma, principally looking for endogenous RNA. We had no real suspicions of other things there. We rapidly discovered there were all these sequence reads that couldn’t be mapped to the human genome.” He and his colleagues reported these results in a 2012 PLOS One paper. “Surveying human plasma for microRNA biomarkers using next-generation sequencing technology, we observed that a significant fraction of the circulating RNA appear to originate from exogenous species,” the researchers wrote. “With careful analysis of sequence error statistics and other controls, we demonstrated that there is a wide range of RNA from many different organisms, including bacteria and fungi as well as from other species.” They noted that some of these RNAs may be associated with proteins, lipids, or other molecules protecting them from RNAse in plasma. Of these, some were detected in intracellular complexes that could influence cellular activities in vitro. “These findings raise the possibility that plasma RNAs of exogenous origin may serve as signaling molecules mediating, for example, the human-microbiome interaction and may affect and/or indicate the state of human health,” the researchers wrote. Commenting further on this work, Dr. Galas said: “The surprise is there is a lot of foreign RNA in our systems. The puzzles are how does it get there, what influence does it have, and does it have an impact on our health.” Given the powerful effects of circulating human microRNAs, Dr. Galas suspects foreign RNAs also produce effects. “I think this is a whole new field that’s been opened by the next-gen sequencing technologies,” he said. “There’s really no other way to do it.”