Template Mining for Information Extraction from Digital Documents

Library Trends, Summer, 1999 by Gobinda G. Chowdhury

ABSTRACT

WITH THE RAPID GROWTH OF DIGITAL INFORMATION RESOURCES, information extraction (IE)--the process of automatically extracting information from natural language texts--is becoming more important. A number of IE systems, particularly in the areas of news/fact retrieval and in domain-specific areas, such as in chemical and patent information retrieval, have been developed in the recent past using the template mining approach that involves a natural language processing (NLP) technique to extract data directly from text if either the data and/or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to the instructions associated with that template. This article briefly reviews template mining research. It also shows how templates are used in Web search engines--such as Alta Vista--and in meta-search engines--such as Ask Jeeves--for helping end-users generate natural language search expressions. Some potential areas of application of template mining for extraction of different kinds of information from digital documents are highlighted, and how such applications are used are indicated. It is suggested that, in order to facilitate template mining, standardization in the presentation, and layout of information within digital documents has to be ensured, and this can be done by generating various templates that authors can easily download and use while preparing digital documents.

INFORMATION EXTRACTION AND TEMPLATE MINING

Information extraction (IE), the process of automatically extracting information from natural language texts, is gaining more and more importance due to the fast growth of digital information resources. Most work on IE has emerged from research into rule-based systems in natural language processing. Croft (1995) suggested that IE techniques, primarily developed in the context of the Advanced Research Projects Agency (ARPA) Message Understanding Conferences (MUCs), are designed to identify database entities, attributes, and relationships in full text. Gaizauskas and Wilks (1998) defined IE as the activity of automatically extracting pre-specified sorts of information from short natural language texts typically, but by no means exclusively, newswire articles. Although works related to IE date back to the 1960s, perhaps the first detailed review of IE as an area of research interest in its own right was by Cowie and Lehnert (1996). However, a detailed review dividing the literature on IE into three different groups--namely, the early work on template filling, the Message Understanding Conferences (MUCs), and other works on information extraction--has recently been published by Gaizauskas and Wilks (1998).

Template mining is a particular technique used in IE. Lawson et al. (1996) defined template mining as a natural language processing (NLP) technique used to extract data directly from text if either the data and/ or text surrounding the data form recognizable patterns. When text matches a template, the system extracts data according to instructions associated with that template. Although different techniques are used for information extraction and knowledge discovery--as described by Cowie and Lehnert (1996), Gaizauskas and Wilks (1998), and Vickery (1997)--template mining is probably the oldest information extraction technique. Gaizauskas and Wilks (1998) reported that templates were used to extract data from natural language texts against which "fact retrieval" could be carried out in the Linguistic String Project at New York University that began in the mid-1960s and continued into the 1980s (reported by Sager, 1981). Numerous studies have been conducted, though most of them are domain-specific, using templates for extracting information from texts. This article briefly reviews some of these works. It also shows how templates are used for information retrieval purposes in major Web search engines like AltaVista (http:// www.altavista.com). This discussion proposes that template mining has great potential in extracting different kinds of information from documents in a digital library environment. To justify this proposition, this article reports some preliminary tests carried out on digital documents, more specifically on some articles published in the D-Lib Magazine (http://www.dilib.org/dilib).

WORKS ON TEMPLATE MINING

Template mining has been used successfully in different area:

* extraction of proper names by Coates-Stephens (1992), Wakao et al. (1996), and by Cowey and Lehnert (1996);

* extraction of facts from press releases related to company and financial information in systems like ATRANS (Lytinen & Gershman, 1986), SCISOR (Jacobs & Rau, 1990), JASPER (Andersen, et al., 1992; Andersen & Huettner, 1994), LOLITA (Costantino, Morgan, & Collingham, 1996), and FIES (Chong & Goh, 1997);

* abstracting scientific papers by Jones and Paice (1992);

* summarizing new product information by Shuldberg et al. (1993);

* extraction of data from analytical chemistry papers by Postma et al. (1990a, 1990b) and Postma and Kateman (1993);

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale