Enhancing a biomedical information extraction system with dictionary mining and context disambiguation
IBM Journal of Research and Development, Sep-Nov 2004 by Mukherjea, S, Subramaniam, L V, Chanda, G, Sankararaman, S, Et al
* Many biological terms contain uppercase letters, numerical figures, and non-alphabetical characters (for example, PTEN, c-N-ras, CD4-Positive T-Lymphocytes). To identify these terms, we use the following two regular expressions:
The first regular expression matches any word that has uppercase alphabets, numbers, or special characters. The second regular expression can be used to filter out some non-biological words such as 10,000, H.D.Smith, and proper nouns.
* In the documents many biological concept names are preceded or followed by keywords or signals that give an indication of their class (for example, p16 tumor suppressor gene, pancreatic alpha cells, proteins Rac1 and Cdc42). We have formed regular expressions for such signal words:
* Biological concept names often contain prefixes and suffixes that give an indication of their class [17]. For example, many proteins end with use (for example, amylase). Therefore, the following regular expression is useful:
As discussed in the next section, we try to learn these prefixes and suffixes from UMLS. The input to the rule engine is a noun phrase with all of the leading and trailing stop words removed. The input phrase is matched with the regular expressions. If there is no match, the individual words of the phrase are matched with the regular expressions. Each match is given a score based on the importance of the regular expression (also specified in the rule base). If the overall score is greater than a threshold specified in the configuration file, it is considered to be a biological term. For example, in the phrase protein-kinase proto oncogens, protein-kinase is a protein signal and oncogene is a gene signal. If the combined score of the two matches is greater than the threshold, the term is considered to be a biological term. (Note that BioAnnotator outputs one annotation per term, even if the subterms are themselves identified as biologically relevant.)
Evaluation
We performed a formal evaluation of BioAnnotator term extraction using the publicly available GENIA 1.1 corpus [18]. This corpus contains abstracts of 670 research papers as well as a list of the biological terms manually identified in them by human experts.
The BioAnnotator results are compared with the manual annotations. When a term from BioAnnotator is matched with a human-annotated term, one can look for an exact or approximate match. For an exact match, the annotations from BioAnnotator and experts should match exactly. For an approximate match, one of the annotations should be a substring of the other. Table 1 summarizes the results, in which, for a system which finds m correct terms and n incorrect terms, the precision is m/(m n), and in a document containing p biological terms, the recall would be m/p. Note that we have also shown the F-score, which is the harmonic mean of precision and recall. It is calculated as (2 * Precision * Recall)/(Precision Recall). Further details about the evaluation are presented in [2].
4. Learning affixes from UMLS
- 5 Rules for Immediate Annuities
- Death in the Family: 12 Things to Do Now
- Dumbest Things You Do With Your Money
- 6 Online Networking Mistakes to Avoid
- 401(k) Mistakes to Avoid
- 5 Economic Scenarios to Keep You Up at Night
- The Real ‘Best Places to Retire’
- Best Credit Cards for You
- 12 Tough Questions to Ask Your Parents
- The Real ‘Best Colleges’
- Home Buyer Tax Credit: How to Cash In
- Why You Shouldn't Bash Cash
- 8 Phony 'Bargains' and Better Alternatives
- Danger: 3 Debit Card Scams to Avoid
- 6 Myths About Gas Mileage
- 29 Fees We Hate Most
- Quick and Easy Ways to Boost Returns
- Best Stocks to Buy Now
- Lower Your Taxes: 10 Moves to Make Now
- New Jobs: 8 Lessons from Real-Life Career Switchers
- The New Job Market: Who Wins and Who Loses?
- Health Care Reform's Public Option: Everything You Need to Know
- Volunteer Work When Unemployed: Should You Work for Free?
- Whose Recovery Is This?
- Long-Term-Care Insurance: 4 Biggest Risks to Avoid
Content provided in partnership with
Most Recent Technology Articles
Most Recent Technology Publications
Most Popular Technology Articles
- BizRate to monitor in-store customer satisfaction for Office Depot stores - Market Intelligence
- Speed control of separately excited DC motor
- Building cost comparison between conventional and formwork system: a case study of four-storey school buildings in Malaysia
- Political stability and economic growth in Asia
- Failed businesses in Japan: a study of how different companies have failed, and tips on how to succeed, in the Japanese market



