advertisement
Click Here

eShopmonitor: A comprehensive data extraction tool for monitoring Web sites, The

IBM Journal of Research and Development, Sep-Nov 2004 by Agrawal, N, Ananthanarayanan, R, Gupta, R, Joshi, S, Et al

Typical commercial Web sites publish information from multiple back-end data sources; these data sources are also updated very frequently. Given the size of most commercial sites today, it becomes essential to have an automated means of checking for correctness and consistency of data. The eShopmonitor allows users to specify items of interest to be tracked, monitors these items on the Web pages, and reports on any changes observed. Our solution comprises a crawler, a miner, a reporter, and a user component that work together to achieve the above functionality. The miner learns to locate the items of interest on a class of pages based on just one sample supplied by the user, via the user interface (UI) provided. The learning algorithm is based on the XPaths of the Document Object Model (DOM) of the page.

1. Introduction

Reliability, timeliness, and correctness of information on Web sites are matters of concern to every system administrator. In the case of commercial sites, inaccuracies and factually incorrect information on the Web site could lead to serious losses and legal problems, apart from losing customer interest and goodwill. For example, an airline website could erroneously offer air tickets for unusually low prices, or an online store might display products with incorrect prices.1 Further, the data displayed on Web pages might be derived from multiple dynamic upstream data sources, and errors could creep in because of some obscure error at the database end. Even when the pages are statically generated, there could be changes to the links of the page, leading to problems of missing and inconsistent links. While different kinds of checks may be enforced at the database level to ensure consistency and correctness of data, these are not always sufficient to trap all errors that arise on the Web pages. For instance, missing links, incorrect links, and incorrect or missing images are some of the problems that may arise on the site even when the database has been checked for consistency and correctness.

In view of these problems, it would be useful for the site owner or administrator to have a tool to monitor important Web pages and detect anomalies of the kinds mentioned above. To make this notion more concrete, let us define a field or item of interest as an HTML element whose value is of high importance to the Web site. For example, HTML elements that contain a product name, product price, promotion, or discount offer are all fields of interest. Therefore, a field of interest is any HTML element whose value should not display anomalous behavior. Since the values of these fields of interest must be extracted or "mined" from the Web page, we call the anomalies concerning these fields of interest mining anomalies. Other anomalies are called crawling anomalies, since they can be detected by a crawler. To summarize, we are interested in detecting the following kinds of anomalies:

1. Mining-related anomalies; for example,

* Identify pages where some specific item of information has changed, e.g., pages where 'ProductPrice' has changed by more than 10%.

* Identify pages where the value of some specific item of interest is x, e.g., pages where 'ProductAvailability' equals 'Not In Stock'. Instead of =, other operations such as ≥, ≤, and substring can also be used.

* Identify pages where a particular field of interest is missing, e.g., pages where 'Productlmage' is missing.

2. Crawling-related anomalies; for example,

* Navigational anomalies; e.g., there is no click-path between the laptop sub-site and the CD-ROM drive sub-site.

* Missing or broken links.

* Documents returning HTTP 404 (Page not found) or HTTP 50× (internal server errors), etc.

Anomalies can also be composed of more than one field of interest-for example, 'ProductName' = 'Desktop' and 'ProductPrice' ≤ $100. Generalizing this concept, an anomaly can be considered as a query with a set of constraints connected by logical operators. On execution, the queries may retrieve some results, which can then be classified as anomalies or non-anomalies by the user.

From the usage point of view, note that such a tool is of interest to the following kinds of users:

1. The Webmaster, who wants to eliminate broken links, internal server errors, etc.

2. The content manager, who wants to ensure that a) correct information is displayed, b) there is no missing information, c) there is no mismatching of information, and d) there are no anomalous changes in any field of interest.

3. The marketing researcher, who can use such a tool to study a competitor's Web site. For example, the laptop promotions offered by a competitor over the last month or so can be followed.

We have built the eShopmonitor to perform these tasks in a comprehensive end-to-end manner. The solution comprises three major components-a crawler, which retrieves pages of interest to the user; a miner, which allows the user to specify the fields of interest in the different kinds of pages and subsequently extracts these fields from the crawled pages; and a reporter, which generates reports on the gathered information. Further, the crawled data can be compared with snapshots of previously crawled data in order to detect anomalies and/or interesting changes in the fields of interest. To allow this, the eShopmonitor stores the last 30 crawls.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest