Natural Language Processing for Cancer Research Network: Surveillance Studies

This application addresses Broad Challenge Area: (10) Information Technology for Processing Health Care Data and specific Challenge Topic: 10-CA-107 Expand Spectrum of Cancer Surveillance through Informatics Approaches. The proposed project launches a collaborative effort to advance adoption within the HMO Cancer Research Network (CRN) of "industrial-strength" natural language processing (NLP) systems useful for mining valuable, research-grade information from unstructured clinical text. Such text is available for processing, now in the electronic medical record (EMR) systems of affiliated CRN health plans. The proposed NLP methods will create ongoing capacity to tap what has recently been described as "a treasure trove of historical unstructured data that provides essential information for the study of disease progression, treatment effectiveness and long-term outcomes" (5). The vision of advancing widespread NLP capacity across the CRN, as well as the approach we present here for implementing it, grew out of an in-depth strategic planning effort we completed in December 2008. That effort involved participants from six CRN sites guided by a blue-ribbon panel of NLP experts from three of the nation's leading centers of clinical NLP research: University of Pittsburgh Medical Center, Vanderbilt University, and Mayo Clinic. The vision is to deploy a powerful NLP system locally, manage it with newly hired and trained local NLP technical staff, and conduct NLP-based research projects initiated by local investigators, in consultation with higher-level external NLP experts. Our planning efforts suggest this collaborative model is feasible; we will test the model in the context of the proposed project. An important development in April 2009 yielded what we believe is a potentially transformative opportunity to accelerate adoption of NLP capacity in applied research settings: release of the open-source Clinical Text Analysis and Knowledge Extraction System (cTAKES) software. This software was the result of a collaborative effort between IBM and Mayo Clinic. Built on the same framework Mayo Clinic currently uses to process its repository of over 40 million clinical documents, cTAKES dramatically lowers the cost of adopting a comprehensive and flexible NLP system. Deployment and use of such systems was previously only feasible in institutions with large, academically-oriented biomedical informatics research programs. Still, other deployment challenges and the need to acquire NLP training for local staff present residual barriers to adopting comprehensive NLP systems such as cTAKES. In collaboration with five other CRN sites the proposed project mitigates these challenges in two ways: 1) it develops configurable open-source software modules needed to streamline and therefore reduce the cost of deploying cTAKES, and 2) it presents and tests a model for training local staff through hands-on NLP projects overseen by outside NLP expert consultants. The potential impact of this project is evident most clearly in the vast untapped opportunities for text mining represented in CRN-affiliated health plans, where EMR systems have been in place since at least 2005, and whose patients represent 4% of the U.S. population. Clinical text mining offers the potential to provide new or improved data elements for cancer surveillance and other types of research requiring information about patient functional status, medication side-effects, details of therapeutic approaches, and differential information about clinical findings. Another significant impact of this project is its plan to integrate into the cTAKES system an open-source de-identification tool based on state of the art, best of breed NLP approaches developed by the MITRE Corporation. De-identification of clinical text will make it easier for researchers to get access to clinical text, and will also facilitate multi-site collaborations while protecting patient privacy. Finally, if successful, the NL

<< Back to the list of CRN projects