CIENCIASMEDICASNEWS: Computing Reviews, the leading online review service for computing literature.

Computing Reviews, the leading online review service for computing literature.

The Critical Need for Computational Intelligence in Human Genetics

Jason H. Moore
Dartmouth College

1. Introduction

An important goal of human genetics is to identify which genes and which specific DNA sequence variations (polymorphisms) play an important role in determining susceptibility to common diseases such as cancer, essential hypertension, and schizophrenia. Human genetics has been very successful in identifying rare DNA sequence variations (mutations) that predict with virtual certainty whether an individual will develop a rare disease such as cystic fibrosis or sickle cell anemia. However, this same level of success has not been observed for common diseases that represent the majority of the public health and economic burden. This general lack of success is due to the enormous complexity of common diseases, such as cardiovascular disease, that are determined by many genes and many environmental factors that interact in a nonlinear manner [1]. Gene-gene interactions (epistasis) and gene-environment interactions (plastic reaction norms) are important components of the genetic architecture of common diseases and are receiving increased attention as geneticists and epidemiologists move from studying diseases with simple etiologies to diseases with very complex etiologies. I will briefly introduce some of the computational challenges we are facing in human genetics, hopefully as a way to motivate computer scientists to establish meaningful collaborations with geneticists and epidemiologists on the front lines of identifying genetic risk factors for common human diseases. This new domain needs computational attention.

2. Technology

One of the challenges in studying common human diseases has been the general lack of technologies that allow us to measure the entire genome (all 3*10⁹nucleotides) quickly and inexpensively. If susceptibility to disease is determined by many different genes working together in a network, then we need to be able to measure all relevant DNA sequence variations across the human genome. This has recently changed with the availability of chip-based technologies that make it feasible to measure a representative set of approximately 10⁶ polymorphisms that, due to the correlation structure in the genome (linkage disequilibrium), captures most of the relevant information. The accessibility of inexpensive chips has ushered in the era of the "genome-wide" or "whole-genome" association study that is expected to revolutionize human genetics [2,3]. Just around the corner are new technologies that will allow us to measure the entire genome of an individual for less than $1,000 [4]. Thus, in the near future, we will have access to all of the genetic information from each subject in the study.

3. Computational Challenges

In addition to the excitement about these new technologies, there are some important concerns [5] and some important challenges [6]. The first challenge that needs to be addressed is the development of powerful computational methods for modeling the relationship between combinations of polymorphisms and disease susceptibility. Characterizing the relationship between multiple interacting polymorphisms and disease susceptibility is much more difficult than assessing each polymorphism individually. Modeling nonlinear interactions between multiple attributes is a well-recognized problem in data mining and knowledge discovery [7].

The second challenge that needs to be overcome is the selection of attributes. Identifying the optimal combination of polymorphisms from an astronomical number of possible combinations is computationally infeasible, especially when the polymorphisms do not have independent effects. The following example illustrates the computational magnitude of the problem. Let’s assume that 10⁶polymorphisms have been measured. Let’s also assume that 1,000 computational evaluations can be completed in one second on a single processor and that 1,000 processors are available for use. Exhaustively evaluating all of the approximately 4.9 * 10¹¹ two-way combinations of polymorphisms would require approximately 5.7 days. Exhaustively evaluating all of the approximately 1.6 * 10¹⁷ three-way combinations of polymorphisms would require 1,929,007 years. This of course assumes a best-case scenario in which the genetic model of interest consists of only two or three important attributes or genetic variations. The reality is that common human diseases are likely the result of many genetic and environmental factors that interact in a complex manner in the absence of independent effects.

4. Computational Intelligence Solutions

The combination of concept difficulty and high data dimensionality is motivating the development of new computational methods that are able to embrace, rather than ignore, the complexity of the mapping relationship between the genome and disease outcomes. Many of these new methods fall under the umbrella of computational intelligence and make use of heuristic strategies such as neural networks and evolutionary computing. For example, the novel multifactor dimensionality reduction (MDR) approach was developed specifically to detect and characterize nonlinear interactions between multiple polymorphisms in the absence of independent effects using constructive induction [8]. Evolutionary computing methods that are able to incorporate expert knowledge about the problem domain have been used to select attributes for methods such as MDR in an intelligent manner to overcome the computational limitations described above [9]. Evolutionary computing approaches such as genetic algorithms and genetic programming have been developed for many years now as automated discovery tools that evolve a solution to a problem by generating many solutions, picking the best ones and generating variability in those solutions by exchanging or recombining pieces of solutions. A positive aspect of evolutionary computing in the domain of human genetics is its ability to search the solution space in parallel. However, success depends on the ability of the algorithm to assign a high fitness or quality to a given solution when only part of the best solution is present. These easily recognizable partial solutions are called building blocks. The problem with modeling nonlinear interactions is that good building blocks are not always apparent to the algorithms because partial solutions don’t look any different than the background noise in the data. This is what Goldberg [10] and others have called a needle in a haystack problem. The challenge for an evolutionary computing algorithm, or any other search algorithm, is to find the genetic needle in an astronomical genome-wide haystack. In a recent publication [9], I described how expert knowledge from prior computational analyses or what is known about the biology of the genes being studied can be used to help guide stochastic search algorithms. These methods are examples of how domain knowledge can be used to confront the computational complexity of the problem. However, these approaches represent a starting point and not a mature solution to the problem. Further work in this area is needed.

5. The Future

As human genetics shifts away from a gene-centric research paradigm to one that is dominated by considering the genome and its three billion nucleotides in its entirety, there will be an increasing demand for methods and software from the field of computer science that recognize and embrace the complexity and size of the problem. Numerous opportunities exist in this domain and other closely related domains. Consider, for example, the newly established Human Microbiome Project [11,12], which will help researchers understand the diversity of microbes that exist in each human and how that microbial diversity relates to health and disease. An important goal of this project is to sample the microbial flora of different human subjects and then use high-throughput DNA sequencing methods to determine exactly which bacteria are present and how they are related to one another. This is a daunting computational challenge because there are currently multiple genomes from multiple different organisms that differ from human subject to human subject that need to be mined for patterns that predict health and disease. This is a hot new field that, like the other challenges discussed above, greatly needs the input of computer scientists.

In order to make an impact in the emerging area of understanding how the human genome and the genomes of symbiotic organisms relate to human biology, computer scientists on the cutting edge of artificial intelligence, machine learning, and data mining must collaborate with geneticists. The time is now to develop the algorithms and software that will be needed in several years to make sense of entire genomes of information measured in hundreds or thousands of subjects in population-based studies of human disease susceptibility. Only then will human genetics be able to extract the knowledge necessary to impact public health by translating research findings into new treatment and prevention strategies.

Table 1: Human genetics terminology

TERM	DEFINITION
Allele	A state defined by an individual nucleotide or combination of nucleotides on one of the two heritable DNA strands in the genome.
Association study	A common study design in genetics and epidemiology. The goal is to identify genetic and environmental risk factors by comparing the frequency of a particular factor between randomly sampled subjects with disease (cases) and randomly sampled subjects without the disease (controls).
Epistasis	The literal translation is "standing upon." In other words, epistasis is used to describe the effect of one gene standing upon or influencing the effect of one or more other genes. The distinction can be made between biological epistasis that describes physical interactions among biological molecules (for example, proteins) and statistical epistasis that describes deviations from linearity in a statistical model relating variability in multiple genes to variability in a disease outcome in a human population. Also called gene-gene interaction.
Genome	All genetic information encoded in DNA that is passed on to offspring. The human genome is comprised of approximately three billion individual nucleotides or chemical bases that contain approximately 20,000 protein-coding genes.
Genotype	A state defined by the two alleles at one specific location along the DNA.
Linkage disequilibrium	Used to measure the correlation between alleles at different locations along the DNA strand.
Mutation	An allele or genotype that is rare in the population (<1% frequency). Often, a change in the DNA caused by a chemical agent or exposure to radiation. Rare diseases such as cystic fibrosis are often caused by single mutations in a gene.
Phenotype	The biological state of an organism. Height, weight, and blood pressure are examples.
Polymorphism	An allele or genotype that is common in the population (≥ 1% frequency).
Reaction norm	The variability in phenotype (plasticity) that is possible for a single genotype across different environmental exposures. Also called gene-environment interaction.

Created: March 03 2008
Last updated: Feb 06 2013


1)	Sing, C.F., Stengård, J.H., Kardia, S.L. Genes, environment, and cardiovascular disease. Arteriosclerosis, Thrombosis, and Vascular Biology 23, 7 (2003) 1190-1196.
2)	Hirschhorn, J.N., Daly, M.J. Genome-wide association studies for common diseases and complex traits. Nature Reviews Genetics 6, 2 (2005) 95-108.
3)	Wang, W.Y., Barratt, B.J., Clayton, D.G., Todd, J.A. Genome-wide association studies: theoretical and practical concerns. Nature Reviews Genetics 6, 2 (2005) 109-118.
4)	Service, R.F. Gene sequencing. The race for the $1000 genome. Science 311, 5767 (2006) 1544-1546.
5)	Williams, S.M., Canter, J.A., Crawford, D.C., Moore, J.H., Ritchie, M.D., Haines, J.L. Problems with genome-wide association studies. Science 316, 5833 (2007) 1840-1842.
6)	Moore, J.H., Ritchie, M.D. The challenges of whole-genome approaches to common diseases. JAMA 291, 13 (2004) 1642-1643.
7)	Freitas, A. Understanding the crucial role of attribute interaction in data mining. Artificial Intelligence Review16, 3 (2001) 177-199.
8)	Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., Moore, J.H. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. The American Journal of Human Genetics 69, 1 (2001) 138-147.
9)	Moore, J.H. Genome-wide analysis of epistasis using multifactor dimensionality reduction: feature selection and construction in the domain of human genetics. In Knowledge discovery and data mining: challenges and realities, IGI Press (2007) 17-30.
10)	Goldberg, D.E. The design of innovation. Springer, 2002.
11)	http://nihroadmap.nih.gov/hmp/
12)	Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I. The human microbiome project. Nature 449 (2007) 804-810.