Enhancing a biomedical information extraction system with dictionary mining and context disambiguation

IBM Journal of Research and Development, Sep-Nov 2004 by Mukherjea, S, Subramaniam, L V, Chanda, G, Sankararaman, S, Et al

* Our current technique of context disambiguation is restricted to proteins, genes, and RNA. We must extend the algorithm so that it can disambiguate among other semantic classes also. Moreover, since many of the gene names in LocusLink are common English words (for example, we, high, star), an effective method of disambiguating between two words is required. We plan to use the part of speech of the words for this purpose.

* Acquiring labeled data for learning is extremely difficult in this domain. In this work we have exploited the fact that explicitly disambiguated terms are often present in documents. In the future we plan to utilize metadata, such as MeSH descriptors assigned manually to each MedLine abstract, for disambiguation.

1 Although the parser can identify other types of phrases such as noun prepositional phrases and verb groups, our evaluation shows that using only noun phrases gives the best results.

2 Our regular expressions follow the java.util.regex convention.

3 The term context window size refers to the number of words examined at either side of the phrase being disambiguated.

References

1. MedLine; see http://www.ncbi.nlm.nih.gov/PubMed/.>2. L. V. Subramaniam, S. Mukherjea, P. Kankar, B. Srivastava, V. Batra, P. Kamesam, and R. Kothari, "Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application," Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, 2003, pp. 410-417.

3. UMLS; see http://umlsks.nlm.nih.gov/.>4. LocusLink; see http://www.ncbi.nlm.nih.gov/locuslink/.>5. Proceedings of the Sixth Message Understanding Conference (MUC), Columbia, MD; Morgan Kaufmann Publishers, 1995.

6. K. Humphreys, G. Demetriou, and R. Gaizauskas, "Two Applications of Information Extraction to Biological Science: Enzyme Interactions and Protein Structures," Proceedings of the Pacific Symposium on Riocompitting, Hawaii, 2000, pp. 502-513.

7. K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi, "Toward Information Extraction: Identifying Protein Names from Biological Papers," Proceedings of the Pacific Symposium on Biocomputing, Hawaii, 1998, pp. 707-718.

8. R. Gaizauskas, G. Demetriou, and K. Humphreys, "Term Recognition and Classification in Biological Science Journal Articles," Proceedings of the Computational Terminology for Medical and Biological Applications: Workshop of the 2nd International Conference on Natural Language Processing, Patras, Greece, 1998, pp. 37-44.

9. N. Collier, C. Nobata, and J. Tsujii, "Extracting the Names of Genes and Gene Products with a Hidden Markov Model," Proceedings of the 18th International Conference on Computational Linguistics, Saarbrucken, Germany, 2000, pp. 201-207.

10. M. Anclrade and A. Valencia, "Automatic Extraction of Keywords from Scientific Text: Application to the Knowledge Domain of Protein Families," BioInform. 4, No. 7, 600-607 (1998).

11. Y. Park, "Identification of Probable Real Words: An Entropy-based Approach," Proceedings of the Association for Computational Linguistics (ACL-02) Workshop on Unsupervised Lexical Acquisition, Saarbrucken, Germany, 2002, pp. 1-8.

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
Click Here
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest