advertisement

Interactive and automatic query expansion: a comparative study with an application on Arabic

American Journal of Applied Sciences, Nov, 2008 by Ghassan Kanaan, Riyad Al- Shalabi, Sameh Ghwanmeh, Basel Bani-Ismail

INTRODUCTION

Information Retrieval (IR) is a task of selecting documents from a database in response to a user's query and ranking them according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding) that:

* Select terms (words, phrases and other units) from documents that are deemed to best represent their content

* Create an inverted index file (or files) that provide an easy access to documents containing these terms

A subsequent search process attempts to match preprocessed user queries against term based representations of documents in each case determining a degree of relevance between the two, which depends upon the number and types of matching terms. A search is successful if it can return as many as possible documents which are relevant to the query, with as few as possible non-relevant documents. In addition, the relevant documents should be ranked ahead of non-relevant ones (1).

Query expansion techniques aim to improve a user's search by adding new query terms to an existing query. A standard method of performing query expansion is to use relevance information from the user; those documents a user has assessed as containing relevant information. The content of these relevant documents can be used to form a set of possible expansion terms, ranked by some measure that describes how useful the terms might be in attracting more relevant documents (2). All or some of these expansion terms can be added to the query either by the user-interactive query expansion (IQE)-or by the retrieval system-Automatic Query Expansion (AQE) (3). One argument in favor of AQE is that the system has access to more statistical information on the relative utility of expansion terms and can make a better selection of which terms to add to the user's query. The main argument in favor of IQE is that interactive query expansion gives more control to the user. As it is the user who decides the criteria for relevance in a search, then the user should be able to make better decisions on which terms are likely to be useful (4).

In this research the potential effectiveness of interactive query expansion has been examined. A chain of experiments were carried out using 242 Arabic abstracts From the Saudi Arabian National Computer Conference. The experiments have been conducted to provide a clear comparison between AQE (collection dependent) strategy and IQE techniques. Evaluation process has been performed to reveal the best value of n in AQE strategy that gives the optimal value of average precision for the whole query.

Automatic query expansion strategies: In (3) the author discusses three AQE techniques. These techniques act as baseline performance measures for comparing AQE and IQE:

Collection independent expansion: A common approach to AQE is to add a fixed number of terms, n, to each query.

Collection dependent expansion: when using a specific test collection we can calculate a better value of n; one that is specific to the test collection used. To calculate n, for each collection, we compared the average precision over all the queries used in each collection after the addition of the top n expansion terms, where n varies from 1 to 15. The value of n that gives the optimal value of average precision for the whole query set was taken to be the value of n for each query in the collection. These values could not be calculated in an operational environment, where knowledge of all queries submitted is unknown. However, it gives a stricter AQE baseline measure as the value of n is optimal for the collection used.

Query dependent expansion: The collection dependent expansion strategy adds a fixed number of terms to each query within a test collection. This is optimal for the entire query set, but may be sub-optimal for individual queries. This means some queries may give better retrieval effectiveness for greater or smaller values of n. The query dependent expansion strategy calculates which value of n is optimal for individual queries. This may be implemented in an operational retrieval system by, for example, setting a threshold on the expansion term weights.

Why arabic language?: Previous studies show that Arabic language is one of the most widely used languages in the world, yet there are relatively few studies on the retrieval of Arabic documents in the literature. Furthermore, the lack of a realistically large test corpus has been a problem in past studies on Arabic retrieval (5). In this research work we will explore a few strategies for the retrieval of Arabic documents, using the recently available TREC Arabic corpus for evaluation. Arabic is a challenging language for Information Retrieval (IR) for a number of reasons. The following problems prevent to make exact keyword ineffective for Arabic retrieval:

* Orthographic variations are prevalent in Arabic; certain combinations of characters can be written in different ways. For example, sometimes in glyphs combining HAMZA or MADDA with ALEF the HAMZA or MADDA is dropped, rendering it ambiguous as to whether the HAMZA or MADDA is present, examples: [TEXT NOT REPRODUCIBLE IN ASCII]


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale