Enhancing a biomedical information extraction system with dictionary mining and context disambiguation

IBM Journal of Research and Development, Sep-Nov 2004 by Mukherjea, S, Subramaniam, L V, Chanda, G, Sankararaman, S, Et al

* Many biological terms contain uppercase letters, numerical figures, and non-alphabetical characters (for example, PTEN, c-N-ras, CD4-Positive T-Lymphocytes). To identify these terms, we use the following two regular expressions:

The first regular expression matches any word that has uppercase alphabets, numbers, or special characters. The second regular expression can be used to filter out some non-biological words such as 10,000, H.D.Smith, and proper nouns.

* In the documents many biological concept names are preceded or followed by keywords or signals that give an indication of their class (for example, p16 tumor suppressor gene, pancreatic alpha cells, proteins Rac1 and Cdc42). We have formed regular expressions for such signal words:

* Biological concept names often contain prefixes and suffixes that give an indication of their class [17]. For example, many proteins end with use (for example, amylase). Therefore, the following regular expression is useful:

As discussed in the next section, we try to learn these prefixes and suffixes from UMLS. The input to the rule engine is a noun phrase with all of the leading and trailing stop words removed. The input phrase is matched with the regular expressions. If there is no match, the individual words of the phrase are matched with the regular expressions. Each match is given a score based on the importance of the regular expression (also specified in the rule base). If the overall score is greater than a threshold specified in the configuration file, it is considered to be a biological term. For example, in the phrase protein-kinase proto oncogens, protein-kinase is a protein signal and oncogene is a gene signal. If the combined score of the two matches is greater than the threshold, the term is considered to be a biological term. (Note that BioAnnotator outputs one annotation per term, even if the subterms are themselves identified as biologically relevant.)

Evaluation

We performed a formal evaluation of BioAnnotator term extraction using the publicly available GENIA 1.1 corpus [18]. This corpus contains abstracts of 670 research papers as well as a list of the biological terms manually identified in them by human experts.

The BioAnnotator results are compared with the manual annotations. When a term from BioAnnotator is matched with a human-annotated term, one can look for an exact or approximate match. For an exact match, the annotations from BioAnnotator and experts should match exactly. For an approximate match, one of the annotations should be a substring of the other. Table 1 summarizes the results, in which, for a system which finds m correct terms and n incorrect terms, the precision is m/(m n), and in a document containing p biological terms, the recall would be m/p. Note that we have also shown the F-score, which is the harmonic mean of precision and recall. It is calculated as (2 * Precision * Recall)/(Precision Recall). Further details about the evaluation are presented in [2].

4. Learning affixes from UMLS


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement
Click Here

Content provided in partnership with ProQuest