Enhancing a biomedical information extraction system with dictionary mining and context disambiguation
IBM Journal of Research and Development, Sep-Nov 2004 by Mukherjea, S, Subramaniam, L V, Chanda, G, Sankararaman, S, Et al
* Our current technique of context disambiguation is restricted to proteins, genes, and RNA. We must extend the algorithm so that it can disambiguate among other semantic classes also. Moreover, since many of the gene names in LocusLink are common English words (for example, we, high, star), an effective method of disambiguating between two words is required. We plan to use the part of speech of the words for this purpose.
* Acquiring labeled data for learning is extremely difficult in this domain. In this work we have exploited the fact that explicitly disambiguated terms are often present in documents. In the future we plan to utilize metadata, such as MeSH descriptors assigned manually to each MedLine abstract, for disambiguation.
1 Although the parser can identify other types of phrases such as noun prepositional phrases and verb groups, our evaluation shows that using only noun phrases gives the best results.
2 Our regular expressions follow the java.util.regex convention.
3 The term context window size refers to the number of words examined at either side of the phrase being disambiguated.
References
1. MedLine; see http://www.ncbi.nlm.nih.gov/PubMed/.
>2. L. V. Subramaniam, S. Mukherjea, P. Kankar, B. Srivastava, V. Batra, P. Kamesam, and R. Kothari, "Information Extraction from Biomedical Literature: Methodology, Evaluation and an Application," Proceedings of the ACM Conference on Information and Knowledge Management, New Orleans, 2003, pp. 410-417.3. UMLS; see http://umlsks.nlm.nih.gov/.
>4. LocusLink; see http://www.ncbi.nlm.nih.gov/locuslink/.>5. Proceedings of the Sixth Message Understanding Conference (MUC), Columbia, MD; Morgan Kaufmann Publishers, 1995.6. K. Humphreys, G. Demetriou, and R. Gaizauskas, "Two Applications of Information Extraction to Biological Science: Enzyme Interactions and Protein Structures," Proceedings of the Pacific Symposium on Riocompitting, Hawaii, 2000, pp. 502-513.
7. K. Fukuda, T. Tsunoda, A. Tamura, and T. Takagi, "Toward Information Extraction: Identifying Protein Names from Biological Papers," Proceedings of the Pacific Symposium on Biocomputing, Hawaii, 1998, pp. 707-718.
8. R. Gaizauskas, G. Demetriou, and K. Humphreys, "Term Recognition and Classification in Biological Science Journal Articles," Proceedings of the Computational Terminology for Medical and Biological Applications: Workshop of the 2nd International Conference on Natural Language Processing, Patras, Greece, 1998, pp. 37-44.
9. N. Collier, C. Nobata, and J. Tsujii, "Extracting the Names of Genes and Gene Products with a Hidden Markov Model," Proceedings of the 18th International Conference on Computational Linguistics, Saarbrucken, Germany, 2000, pp. 201-207.
10. M. Anclrade and A. Valencia, "Automatic Extraction of Keywords from Scientific Text: Application to the Knowledge Domain of Protein Families," BioInform. 4, No. 7, 600-607 (1998).
11. Y. Park, "Identification of Probable Real Words: An Entropy-based Approach," Proceedings of the Association for Computational Linguistics (ACL-02) Workshop on Unsupervised Lexical Acquisition, Saarbrucken, Germany, 2002, pp. 1-8.
Most Recent Technology Articles
- INTERVIEW WITH BEN BUTTERS, DIRECTOR OF EUROPEAN AFFAIRS AT EUROCHAMBRES : "A PERFECT ROAD MAP FOR EU CLUSTERS DOES NOT EXIST".
- AGENDA.(Brief article)(Conference notes)
- FIGHT AGAINST INTERNET PIRACY.
- INTERNET : AUTHORS' SOCIETIES URGE ACTION AGAINST PIRACY.
- TELECOMMUNICATIONS : BUSINESSEUROPE HOSTILE TO FURTHER CONTRACTUAL OBLIGATIONS.(Brief article)
Most Recent Technology Publications
Most Popular Technology Articles
- BizRate to monitor in-store customer satisfaction for Office Depot stores - Market Intelligence
- Speed control of separately excited DC motor
- What is precision air conditioning and why is it necessary?
- Effects of creative, educational drama activities on developing oral skills in primary school children
- 3G: naughty or nice? PhoneErotica.com generates over 300 million hits per month, and rings up more minutes of use per month than MSN



