Lessons learned with Arc, an OAI-PMH service provider

Library Trends, Spring, 2005 by Xiaoming Liu, Kurt Maly, Michael L. Nelson, Mohammad Zubair

[FIGURE 3 OMITTED]

But the OAI-PMH model saves the considerable overhead of establishing TCP/IP and HTTP connections for documents that have not changed. Instead of having to ask each of the 100 files if their modification date has changed, the harvester asks the OAI-PMH interface which files have changed, and the OAI-PMH interface only responds with the files that meet the criteria (see Figure 4).

[FIGURE 4 OMITTED]

Given the parameters stated above, Table 2 shows the relative load placed by each method. If the Web site is larger, say 1,000 or 10,000 files, the unnecessary network traffic avoided with OAI-PMH would be even greater. Even if Web sites updated their content (or added new content) more rapidly, the OAI-PMH approach would still reduce the number of connections by a factor of the number of files batched together in the response (in this example, by a factor of 10). Not only would this reduce the network load for the robots and Web sites, it would also allow for much quicker harvesting of updates and thus more up-to-date Web indices.

Because data providers are different in data volume, partition definition, service implementation quality, and network connection quality, all these factors influence the harvesting procedure. Historical and newly published data harvesting have different requirements. When a service provider harvests a data provider for the first time, all past data (historical data) needs to be harvested, followed by periodic harvesting to keep the data current. Historical data harvests are high volume and more stable. The harvesting process can run once or, as is usually preferred by large archives, as a sequence of chunk-based harvests to reduce data provider overhead. To harvest newly published data, data size is not a major problem, but the scheduler must be able to harvest new data as soon as possible and guarantee completeness. The OAI-PMH provides flexibility in choosing the harvesting strategy, although optimizing it remains an open question. (1)

Scalability Through Hierarchical Harvesting and "Aggregators"

A service provider can also act as a data provider, disseminating metadata harvested from other data providers. This allows for the hierarchical harvesting of content and removes a limitation of having all data providers be at the same "level." This structure has a great deal of flexibility in how information is filtered and interconnected between data providers and service providers. While hierarchical harvesting was not originally part of the OAI-PMH, there was nothing in the protocol that prohibited it. Arc was the first service provider to introduce hierarchical harvesting, and services that provide hierarchical harvesting are now known by the name of "aggregators."

Aggregators may normalize, correct, transform, or otherwise change the harvested metadata. Thus, the re-exposed data might not be the same data harvested from the original data providers. Unless the metadata exposed by the aggregator is completely unchanged, the aggregator must issue new identifiers for the OAI-PMH records it makes available for harvesting. The OAI-PMH defines provenance containers to assist in the de-duplication of metadata harvesting from various sources. Guidelines have since been written to assist in the development of OAI-PMH proxies, caches, and aggregators (Lagoze, Van de Sompel, Nelson, & Warner, 2002b).

 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale