Sep 1, 2013 (Vol. 33, No. 15)
Data Analysis: Today’s Next-Gen Sequencing Imperative
From our September 1 issue: Scientists are finding ways to decipher vast NGS datasets to get the information they need to transform medicine.
Without question, next-generation sequencing (NGS) has been immensely successful, even if it hasn’t transformed medicine just yet. Sequencing itself has become nearly trivial. Declining costs have made it widely accessible. Recently, the rise of affordable benchtop NGS instruments promises to democratize the technology further, extending it out from large sequencing centers into smaller labs and clinics.
Today, the bigger challenge is deciphering the vast NGS datasets of wide-ranging data types—RNA-Seq, ChiP-Seq, exome, etc.—in order to inform biomedical questions ranging from evolutionary heritage to the functioning of cellular molecular machinery. NGS data analysis was among the many topics discussed at last month’s “NGX: Applying Next-Generation Sequencing” meeting.
Integrating various NGS data into networks that are both manageable in size and likely to be true was the core of a talk from MIT’s Ernest Fraenkel, Ph.D., associate professor of biological engineering. Interpreting high-throughput data, he noted, can be like reading “The Hitchhikers’ Guide to the Galaxy,” in which the meaning of life and the universe turns out to be the number 42.
“Suddenly you realize you didn’t understand what the question was,” Dr. Fraenkel said. “That’s often true of high-throughput data. We get different answers and we don’t know what they mean. Our integrative approach is to try to discover a biological process that gives rise to the experimental data we detect.”
Two recent papers describe his approach, which does not rely on published literature or traditional pathway analysis. Rather, the method uses only the physical interactions. The basic idea is to connect those interactions within true biological networks of manageable size. Dr. Fraenkel uses a graph-based Prize-Collecting Steiner Tree (PCST) algorithm to build networks.
Drawing upon roughly a quarter-million physical interactions reported from experiments and data from a given NGS experiment, this PCST-based method identifies highly probable networks. Every interaction—including the quarter of a million he’s already gathered—is given a probability based on reliability factors such as the experimental method used and number of times reported. The number of possible network connections is huge. PCST winnows them down.
“You collect as many prizes—high confidence interactions—as you can in the final network,” he said. “But if you just do that, you still get a hairball”—or a dense usable network. So, “you tell it one more thing. You say every time you use an edge to connect something, you have to pay a price for the edge, and the price goes up the less reliable it is,” Dr. Fraenkel continued. “High-confidence edges are cheap; low-confidence edges are expensive. You ask it to collect as many prizes as possible while paying as little as possible for those edges. That forces the algorithm to decide whether or not to connect something to a bunch of connections to get to other data points—that whole chain of connections has to be really high confidence.”
Dr. Fraenkel and his colleagues have set up a website with links to several tools including its PCST tool. “You can upload a list of genes and press a button and it sends an email back when it’s solved the problem,” he said.
Deciphering Sample Heterogeneity
Sample heterogeneity is a significant challenge when analyzing NGS data. The problem is well captured in a 2011 PLOS One paper from conference speaker Ting Gong, Ph.D., assistant professor of molecular carcinogenesis at MD Anderson Center and her colleagues. “RNA prepared from heterogeneous tissue samples might contain only a fraction of the total cell subpopulation of interest. Consequently, the expression signal of any gene detected directly from a complex sample is a convolution of expressions of all present cell types,” Dr. Gong et al. wrote. “If tissues or cells are used without consideration of such a mixing phenomenon, measurement of differential gene expression will certainly be confounded by the heterogeneous cell populations.” She has since extended a version of her computational method, first used with microarrays, for use with NGS data. Speaking at “NGX”, Dr. Gong discussed a statistical pipeline for distinguishing heterogeneous tissues and cell types based on RNA-Seq data. The method works by generating gene signatures by analyzing data from the ‘‘pure’’ samples—or training data—and applying these signatures to estimate the mixing fractions for the complex samples. “We tested our methods on several well-controlled benchmark datasets with known mixing fractions of pure cell or tissue types and mRNA expression profiling data from samples collected in a clinical trial. Accurate agreement between predicted and actual mixing fractions was observed,” Dr. Gong noted in a published excerpt from this study. “In addition, our method was able to predict mixing fractions for more than ten species of circulating cells and to provide accurate estimates for relatively rare cell types.” Dr. Gong’s group has released an open-source R software package, dubbed DeconRNASeq, for other researchers to use.A Critical Mass
Click Image To Enlarge + |
The New York Genome Center provides both computational work and data storage for collaborating institutions and commercial organizations. [Rita Rose Photography] |
Already Delivering Surprises
Click Image To Enlarge + |
Next-generation sequencing methods have revealed that foreign RNAs are common in human plasma. [Pacific Northwest Diabetes Research Institute (PNDRI)] |
No hay comentarios:
Publicar un comentario