Interactive and automatic query expansion: a comparative study with an application on Arabic
American Journal of Applied Sciences, Nov, 2008 by Ghassan Kanaan, Riyad Al- Shalabi, Sameh Ghwanmeh, Basel Bani-Ismail
INTRODUCTION
Information Retrieval (IR) is a task of selecting documents from a database in response to a user's query and ranking them according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding) that:
* Select terms (words, phrases and other units) from documents that are deemed to best represent their content
* Create an inverted index file (or files) that provide an easy access to documents containing these terms
A subsequent search process attempts to match preprocessed user queries against term based representations of documents in each case determining a degree of relevance between the two, which depends upon the number and types of matching terms. A search is successful if it can return as many as possible documents which are relevant to the query, with as few as possible non-relevant documents. In addition, the relevant documents should be ranked ahead of non-relevant ones (1).
Query expansion techniques aim to improve a user's search by adding new query terms to an existing query. A standard method of performing query expansion is to use relevance information from the user; those documents a user has assessed as containing relevant information. The content of these relevant documents can be used to form a set of possible expansion terms, ranked by some measure that describes how useful the terms might be in attracting more relevant documents (2). All or some of these expansion terms can be added to the query either by the user-interactive query expansion (IQE)-or by the retrieval system-Automatic Query Expansion (AQE) (3). One argument in favor of AQE is that the system has access to more statistical information on the relative utility of expansion terms and can make a better selection of which terms to add to the user's query. The main argument in favor of IQE is that interactive query expansion gives more control to the user. As it is the user who decides the criteria for relevance in a search, then the user should be able to make better decisions on which terms are likely to be useful (4).
In this research the potential effectiveness of interactive query expansion has been examined. A chain of experiments were carried out using 242 Arabic abstracts From the Saudi Arabian National Computer Conference. The experiments have been conducted to provide a clear comparison between AQE (collection dependent) strategy and IQE techniques. Evaluation process has been performed to reveal the best value of n in AQE strategy that gives the optimal value of average precision for the whole query.
Automatic query expansion strategies: In (3) the author discusses three AQE techniques. These techniques act as baseline performance measures for comparing AQE and IQE:
Collection independent expansion: A common approach to AQE is to add a fixed number of terms, n, to each query.
Collection dependent expansion: when using a specific test collection we can calculate a better value of n; one that is specific to the test collection used. To calculate n, for each collection, we compared the average precision over all the queries used in each collection after the addition of the top n expansion terms, where n varies from 1 to 15. The value of n that gives the optimal value of average precision for the whole query set was taken to be the value of n for each query in the collection. These values could not be calculated in an operational environment, where knowledge of all queries submitted is unknown. However, it gives a stricter AQE baseline measure as the value of n is optimal for the collection used.
Query dependent expansion: The collection dependent expansion strategy adds a fixed number of terms to each query within a test collection. This is optimal for the entire query set, but may be sub-optimal for individual queries. This means some queries may give better retrieval effectiveness for greater or smaller values of n. The query dependent expansion strategy calculates which value of n is optimal for individual queries. This may be implemented in an operational retrieval system by, for example, setting a threshold on the expansion term weights.
Why arabic language?: Previous studies show that Arabic language is one of the most widely used languages in the world, yet there are relatively few studies on the retrieval of Arabic documents in the literature. Furthermore, the lack of a realistically large test corpus has been a problem in past studies on Arabic retrieval (5). In this research work we will explore a few strategies for the retrieval of Arabic documents, using the recently available TREC Arabic corpus for evaluation. Arabic is a challenging language for Information Retrieval (IR) for a number of reasons. The following problems prevent to make exact keyword ineffective for Arabic retrieval:
* Orthographic variations are prevalent in Arabic; certain combinations of characters can be written in different ways. For example, sometimes in glyphs combining HAMZA or MADDA with ALEF the HAMZA or MADDA is dropped, rendering it ambiguous as to whether the HAMZA or MADDA is present, examples: [TEXT NOT REPRODUCIBLE IN ASCII]
- 5 Rules for Immediate Annuities
- Death in the Family: 12 Things to Do Now
- Dumbest Things You Do With Your Money
- 6 Online Networking Mistakes to Avoid
- 401(k) Mistakes to Avoid
- 5 Economic Scenarios to Keep You Up at Night
- The Real ‘Best Places to Retire’
- Best Credit Cards for You
- 12 Tough Questions to Ask Your Parents
- The Real ‘Best Colleges’
- Home Buyer Tax Credit: How to Cash In
- Why You Shouldn't Bash Cash
- 8 Phony 'Bargains' and Better Alternatives
- Danger: 3 Debit Card Scams to Avoid
- 6 Myths About Gas Mileage
- 29 Fees We Hate Most
- Quick and Easy Ways to Boost Returns
- Best Stocks to Buy Now
- Lower Your Taxes: 10 Moves to Make Now
- New Jobs: 8 Lessons from Real-Life Career Switchers
- The New Job Market: Who Wins and Who Loses?
- Health Care Reform's Public Option: Everything You Need to Know
- Volunteer Work When Unemployed: Should You Work for Free?
- Whose Recovery Is This?
- Long-Term-Care Insurance: 4 Biggest Risks to Avoid
Content provided in partnership with
Most Recent Technology Articles
Most Recent Technology Publications
Most Popular Technology Articles
- BizRate to monitor in-store customer satisfaction for Office Depot stores - Market Intelligence
- Speed control of separately excited DC motor
- Effects of creative, educational drama activities on developing oral skills in primary school children
- Political stability and economic growth in Asia
- Failed businesses in Japan: a study of how different companies have failed, and tips on how to succeed, in the Japanese market


