Chapter 13: Mining Electronic Health Records in the Genomics Era

Joshua C. Denny mail

Hide Figures

Abstract

Abstract: The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.

Citation: Denny JC (2012) Chapter 13: Mining Electronic Health Records in the Genomics Era. PLoS Comput Biol 8(12): e1002823. doi:10.1371/journal.pcbi.1002823
Editors: Fran Lewitter ( Whitehead Institute, United States of America ), and Maricel Kann ( University of Maryland, Baltimore County, United States of America )
Published: December 27, 2012
Copyright: © 2012 Joshua C. Denny. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This article was supported in part by grants from the National Library of Medicine R01 LM 010685 and the National Human Genome Research Institute U01 HG004603. The funders had no role in the preparation of the manuscript.
Competing interests: The author has declared that no competing interests exist.
* E-mail: josh.denny@vanderbilt.edu

What to Learn in This Chapter

Describe the types of information available in Electronic Health Records (EHRs), and the relative sensitivity and positive predictive value of each
Describe the difference between unstructured and structured information in the EHR
Describe methods for developing accurate phenotype algorithms that integrate structured and unstructured EHR information, and the roles played by billing codes, laboratory values, medication data, and natural language processing
Describe recent uses of EHR-derived phenotypes to study genome-phenome relationships
Describe the cost advantages unique to EHR-linked biobanks, and the ability to reuse genetic data for many studies
Understand the role of EHRs to enable phenome-wide association studies of genetic variants