Enhancing a biomedical information extraction system with dictionary mining and context disambiguation

IBM Journal of Research and Development, Sep-Nov 2004 by Mukherjea, S, Subramaniam, L V, Chanda, G, Sankararaman, S, Et al

* Because of the small size of the test set, it may not be representative.

* The locally disambiguated training instances may not be sufficient as examples for all cases.

* The system assigns senses to cases which are not truly classifiable (e.g., tumor suppressor gene activity).

However, the use of persistence of sense and rules enhances the performance greatly.

6. Conclusion

In this paper we have presented a biological annotation system which uses a variety of knowledge sources along with syntactic information, term properties, and contextual clues to identify and classify known and new terms. BioAnnotator attempts to mine biomedical knowledge sources to learn extraction patterns in order to identify new and unknown biological terms. We have also developed a technique for context disambiguation. Our evaluation shows that the system has good precision and recall.

Systems such as BioAnnotator that extract and classify terms from biomedical literature can be utilized to support more complex text analysis tasks for researchers. We believe that the semistructured documents that result from BioAnnotator provide opportunities for additional types of knowledge discovery from biomedical document corpuses. For example, in the extraction of relations between biological entities (for example, protein-protein interactions), it is first necessary to recognize and classify the entities taking part in the interactions [21]. Term extraction is also useful for automatically updating biomedical databases such as SwissProt [22], which are at present largely hand-curated. We have also developed a Web-based application for a semantic search of online biomedical research publications based on the annotated documents. A traditional keyword-based search retrieves only publications that contain the specified keywords. Thus, searching on genes will not retrieve a very relevant document that discusses p53 (a type of gene). On the other hand, since we have annotated the documents with the semantic classes of the biological terms, our semantic search application will be able to retrieve all of the relevant documents. Another application of the annotations produced by the BioAnnotator to discover the biological significance of gene clusters is presented in [2].

Future work is planned along various lines:

* At present BioAnnotator learns only affixes of biological terms from UMLS. Besides prefixes and suffixes, biological concept names often contain root forms that give an indication of their class. For example, many cell names contain blast, cyt, or phore (for example, leucocytes). We must extend our algorithm to learn these root forms. We also must also develop techniques to learn other types of patterns that can be used to identify biological terms. For example, protein, cell, or gene signals can be determined from an annotated corpus.

* Our present evaluation of the UMLS mining technique determines only whether the discovered prefixes and suffixes are biologically relevant. We must also evaluate whether the discovered prefixes and suffixes are relevant for a particular semantic type.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest