Enhancing a biomedical information extraction system with dictionary mining and context disambiguation

IBM Journal of Research and Development, Sep-Nov 2004 by Mukherjea, S, Subramaniam, L V, Chanda, G, Sankararaman, S, Et al

Journals and conference proceedings represent the dominant mechanisms for reporting new biomedical results. The unstructured nature of such publications makes it difficult to utilize data mining or automated knowledge discovery techniques. Annotation (or markup) of these unstructured documents represents the first step in making these documents machine-analyzable. Often, however, the use of similar (or the same) labels for different entities and the use of different labels for the same entity makes entity extraction difficult in biomedical literature. In this paper we present a system called BioAnnotator for identifying and classifying biological terms in documents. BioAnnotator uses domain-based dictionary lookup for recognizing known terms and a rule engine for discovering new terms. We explain how the system uses a biomedical dictionary to learn extraction patterns for the rule engine and how it disambiguates biological terms that belong to multiple semantic classes.

1. Introduction

Biomedical information is growing explosively, and new and useful results are appearing daily in research publications. Many of these publications are available online-for example, in the PubMed MedLine database [1]. However, automatic extraction of useful information from these online sources remains a challenge because these documents are unstructured and expressed in a natural language form. To enable data mining and knowledge discovery from such documents, this data must be made available in a structured format. Because of the very large amounts of data being generated, it is difficult to have human curators extract all of the information and present it in a form usable by data mining and knowledge discovery tools.

Information extraction in the biomedical domain is a challenging task. A major problem is that because of inconsistent naming conventions, a term may be used to denote more than one semantic class. For example, p53 is used to specify both a gene and a protein. Another problem is that new biological terms are continuously being created. Therefore, although several biomedical dictionaries and ontologies have been developed, none of them are up to date with the latest advances in the domain.

We have developed a system called BioAnnotator [2| for identifying biological terms in the scientific literature and annotating the terms with their semantic classes. BioAnnotator first identifies terms that are already known by doing a lookup on various publicly available biomedical dictionaries such as Unified Medical Language System (UMLS) [3] and Locus Link [4]. It then attempts to identify new and unknown terms by using character- and word-level properties of biological terms in addition to contextual clues.

In this paper, we discuss how BioAnnotator handles some of the challenges of biomedical information extraction. The next section cites related work, and section 3 gives an overview of BioAnnotator. Section 4 describes how we use UMLS to discover some extraction patterns for the rule engine. Section 5 explains our technique of determining the semantic class of ambiguous biological terms on the basis of the surrounding context. Finally, section 6 is the conclusion.

2. Related work

The task of extracting biological terms from scientific documents can be considered similar to the named entity task in the Message Understanding Conference (MUC) evaluation exercises [5]. Many biomedical information extraction methods thus represent adaptations of methods originally proposed for MUC [6].

Biological term extraction systems can be broadly divided into two types: those with a rule base and those with a learning method. In [7], protein names are identified in biological papers using hand-coded rules. A rule-based approach combined with dictionary lookup for term recognition and classification is given in [8]. In [9], supervised learning methods based on hidden Markov models arc used. In [10], statistical approaches based on word distributions in a large corpus are used to find biological terms. In [11], an entropy-based approach combined with morphological rules is used for finding terms. An excellent overview of the field is given in [12].

BioAnnotator uses a rule engine as well as biomedical dictionaries for identifying biological terms. Some of the previous rule-based systems have tuned their rules for identifying a small class of terms. For example, [7] has created rules for finding only proteins. On the other hand, BioAnnotator attempts to identify all possible biological terms. Instead of simply using hand-coded rules as in previous systems, we have also used UMLS to "learn" some patterns for extracting biological terms. Moreover, the system is designed so that the rules can easily be modified to identify a different class of entities. In contrast to most of the previous systems, BioAnnotator has also been evaluated using a publicly available corpus. As stated in [12], good evaluation of the existing systems is one of the main challenges in this domain.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest