Template Mining for Information Extraction from Digital Documents
Library Trends, Summer, 1999 by Gobinda G. Chowdhury
ABSTRACT
WITH THE RAPID GROWTH OF DIGITAL INFORMATION RESOURCES, information extraction (IE)--the process of automatically extracting information from natural language texts--is becoming more important. A number of IE systems, particularly in the areas of news/fact retrieval and in domain-specific areas, such as in chemical and patent information retrieval, have been developed in the recent past using the template mining approach that involves a natural language processing (NLP) technique to extract data directly from text if either the data and/or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to the instructions associated with that template. This article briefly reviews template mining research. It also shows how templates are used in Web search engines--such as Alta Vista--and in meta-search engines--such as Ask Jeeves--for helping end-users generate natural language search expressions. Some potential areas of application of template mining for extraction of different kinds of information from digital documents are highlighted, and how such applications are used are indicated. It is suggested that, in order to facilitate template mining, standardization in the presentation, and layout of information within digital documents has to be ensured, and this can be done by generating various templates that authors can easily download and use while preparing digital documents.
INFORMATION EXTRACTION AND TEMPLATE MINING
Information extraction (IE), the process of automatically extracting information from natural language texts, is gaining more and more importance due to the fast growth of digital information resources. Most work on IE has emerged from research into rule-based systems in natural language processing. Croft (1995) suggested that IE techniques, primarily developed in the context of the Advanced Research Projects Agency (ARPA) Message Understanding Conferences (MUCs), are designed to identify database entities, attributes, and relationships in full text. Gaizauskas and Wilks (1998) defined IE as the activity of automatically extracting pre-specified sorts of information from short natural language texts typically, but by no means exclusively, newswire articles. Although works related to IE date back to the 1960s, perhaps the first detailed review of IE as an area of research interest in its own right was by Cowie and Lehnert (1996). However, a detailed review dividing the literature on IE into three different groups--namely, the early work on template filling, the Message Understanding Conferences (MUCs), and other works on information extraction--has recently been published by Gaizauskas and Wilks (1998).
Template mining is a particular technique used in IE. Lawson et al. (1996) defined template mining as a natural language processing (NLP) technique used to extract data directly from text if either the data and/ or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to instructions associated with that template. Although different techniques are used for information extraction and knowledge discovery--as described by Cowie and Lehnert (1996), Gaizauskas and Wilks (1998), and Vickery (1997)--template mining is probably the oldest information extraction technique. Gaizauskas and Wilks (1998) reported that templates were used to extract data from natural language texts against which "fact retrieval" could be carried out in the Linguistic String Project at New York University that began in the mid-1960s and continued into the 1980s (reported by Sager, 1981). Numerous studies have been conducted, though most of them are domain-specific, using templates for extracting information from texts. This article briefly reviews some of these works. It also shows how templates are used for information retrieval purposes in major Web search engines like AltaVista (http:// www.altavista.com). This discussion proposes that template mining has great potential in extracting different kinds of information from documents in a digital library environment. To justify this proposition, this article reports some preliminary tests carried out on digital documents, more specifically on some articles published in the D-Lib Magazine (http://www.dilib.org/dilib).
WORKS ON TEMPLATE MINING
Template mining has been used successfully in different area:
* extraction of proper names by Coates-Stephens (1992), Wakao et al. (1996), and by Cowey and Lehnert (1996);
* extraction of facts from press releases related to company and financial information in systems like ATRANS (Lytinen & Gershman, 1986), SCISOR (Jacobs & Rau, 1990), JASPER (Andersen, et al., 1992; Andersen & Huettner, 1994), LOLITA (Costantino, Morgan, & Collingham, 1996), and FIES (Chong & Goh, 1997);
* abstracting scientific papers by Jones and Paice (1992);
* summarizing new product information by Shuldberg et al. (1993);
* extraction of data from analytical chemistry papers by Postma et al. (1990a, 1990b) and Postma and Kateman (1993);
Most Recent Reference Articles
- ARAB EUROPEAN RELATIONS - Dec 22 - Russia Denies Selling Missile System To Iran
- EGYPT - Dec 29 - Opposition Says Mubarak Blessed Israeli Attacks
- ARAB AFFAIRS - Dec 22 - Syria Will Eventually Move To Direct Talks With Israel
- ARAB AFFAIRS - Dec 30 - GCC Denounces Massacre
- ARAB ISRAELI RELATIONS - Israel Issues An Appeal To Palestinians In Gaza
Most Recent Reference Publications
Most Popular Reference Articles
- The Greek chorus, Jimmy the Greek got it wrong but so did his critics - Jimmy Snyder and his views on pro sports and race
- How Tyler Perry rose from homelessness to a $5 million mansion
- Credit card debt on college campuses: causes, consequences, and solutions
- 9 questions to ask your new lover: what you were afraid to ask, but always wanted to know
- Living by the word: light the candles


