The Critical Need for Computational Intelligence in Human Genetics
Jason H. Moore
An important goal of human genetics is to identify which genes and which specific DNA sequence variations (polymorphisms) play an important role in determining susceptibility to common diseases such as cancer, essential hypertension, and schizophrenia. Human genetics has been very successful in identifying rare DNA sequence variations (mutations) that predict with virtual certainty whether an individual will develop a rare disease such as cystic fibrosis or sickle cell anemia. However, this same level of success has not been observed for common diseases that represent the majority of the public health and economic burden. This general lack of success is due to the enormous complexity of common diseases, such as cardiovascular disease, that are determined by many genes and many environmental factors that interact in a nonlinear manner . Gene-gene interactions (epistasis) and gene-environment interactions (plastic reaction norms) are important components of the genetic architecture of common diseases and are receiving increased attention as geneticists and epidemiologists move from studying diseases with simple etiologies to diseases with very complex etiologies. I will briefly introduce some of the computational challenges we are facing in human genetics, hopefully as a way to motivate computer scientists to establish meaningful collaborations with geneticists and epidemiologists on the front lines of identifying genetic risk factors for common human diseases. This new domain needs computational attention.
One of the challenges in studying common human diseases has been the general lack of technologies that allow us to measure the entire genome (all 3*109nucleotides) quickly and inexpensively. If susceptibility to disease is determined by many different genes working together in a network, then we need to be able to measure all relevant DNA sequence variations across the human genome. This has recently changed with the availability of chip-based technologies that make it feasible to measure a representative set of approximately 106 polymorphisms that, due to the correlation structure in the genome (linkage disequilibrium), captures most of the relevant information. The accessibility of inexpensive chips has ushered in the era of the "genome-wide" or "whole-genome" association study that is expected to revolutionize human genetics [2,3]. Just around the corner are new technologies that will allow us to measure the entire genome of an individual for less than $1,000 . Thus, in the near future, we will have access to all of the genetic information from each subject in the study.
3. Computational Challenges
In addition to the excitement about these new technologies, there are some important concerns  and some important challenges . The first challenge that needs to be addressed is the development of powerful computational methods for modeling the relationship between combinations of polymorphisms and disease susceptibility. Characterizing the relationship between multiple interacting polymorphisms and disease susceptibility is much more difficult than assessing each polymorphism individually. Modeling nonlinear interactions between multiple attributes is a well-recognized problem in data mining and knowledge discovery .
The second challenge that needs to be overcome is the selection of attributes. Identifying the optimal combination of polymorphisms from an astronomical number of possible combinations is computationally infeasible, especially when the polymorphisms do not have independent effects. The following example illustrates the computational magnitude of the problem. Let’s assume that 106polymorphisms have been measured. Let’s also assume that 1,000 computational evaluations can be completed in one second on a single processor and that 1,000 processors are available for use. Exhaustively evaluating all of the approximately 4.9 * 1011 two-way combinations of polymorphisms would require approximately 5.7 days. Exhaustively evaluating all of the approximately 1.6 * 1017 three-way combinations of polymorphisms would require 1,929,007 years. This of course assumes a best-case scenario in which the genetic model of interest consists of only two or three important attributes or genetic variations. The reality is that common human diseases are likely the result of many genetic and environmental factors that interact in a complex manner in the absence of independent effects.
4. Computational Intelligence Solutions
The combination of concept difficulty and high data dimensionality is motivating the development of new computational methods that are able to embrace, rather than ignore, the complexity of the mapping relationship between the genome and disease outcomes. Many of these new methods fall under the umbrella of computational intelligence and make use of heuristic strategies such as neural networks and evolutionary computing. For example, the novel multifactor dimensionality reduction (MDR) approach was developed specifically to detect and characterize nonlinear interactions between multiple polymorphisms in the absence of independent effects using constructive induction . Evolutionary computing methods that are able to incorporate expert knowledge about the problem domain have been used to select attributes for methods such as MDR in an intelligent manner to overcome the computational limitations described above . Evolutionary computing approaches such as genetic algorithms and genetic programming have been developed for many years now as automated discovery tools that evolve a solution to a problem by generating many solutions, picking the best ones and generating variability in those solutions by exchanging or recombining pieces of solutions. A positive aspect of evolutionary computing in the domain of human genetics is its ability to search the solution space in parallel. However, success depends on the ability of the algorithm to assign a high fitness or quality to a given solution when only part of the best solution is present. These easily recognizable partial solutions are called building blocks. The problem with modeling nonlinear interactions is that good building blocks are not always apparent to the algorithms because partial solutions don’t look any different than the background noise in the data. This is what Goldberg  and others have called a needle in a haystack problem. The challenge for an evolutionary computing algorithm, or any other search algorithm, is to find the genetic needle in an astronomical genome-wide haystack. In a recent publication , I described how expert knowledge from prior computational analyses or what is known about the biology of the genes being studied can be used to help guide stochastic search algorithms. These methods are examples of how domain knowledge can be used to confront the computational complexity of the problem. However, these approaches represent a starting point and not a mature solution to the problem. Further work in this area is needed.
5. The Future
As human genetics shifts away from a gene-centric research paradigm to one that is dominated by considering the genome and its three billion nucleotides in its entirety, there will be an increasing demand for methods and software from the field of computer science that recognize and embrace the complexity and size of the problem. Numerous opportunities exist in this domain and other closely related domains. Consider, for example, the newly established Human Microbiome Project [11,12], which will help researchers understand the diversity of microbes that exist in each human and how that microbial diversity relates to health and disease. An important goal of this project is to sample the microbial flora of different human subjects and then use high-throughput DNA sequencing methods to determine exactly which bacteria are present and how they are related to one another. This is a daunting computational challenge because there are currently multiple genomes from multiple different organisms that differ from human subject to human subject that need to be mined for patterns that predict health and disease. This is a hot new field that, like the other challenges discussed above, greatly needs the input of computer scientists.
In order to make an impact in the emerging area of understanding how the human genome and the genomes of symbiotic organisms relate to human biology, computer scientists on the cutting edge of artificial intelligence, machine learning, and data mining must collaborate with geneticists. The time is now to develop the algorithms and software that will be needed in several years to make sense of entire genomes of information measured in hundreds or thousands of subjects in population-based studies of human disease susceptibility. Only then will human genetics be able to extract the knowledge necessary to impact public health by translating research findings into new treatment and prevention strategies.
Table 1: Human genetics terminology