Lessons learned with Arc, an OAI-PMH service provider

Library Trends, Spring, 2005 by Xiaoming Liu, Kurt Maly, Michael L. Nelson, Mohammad Zubair

* harvesting and parsing richer metadata formats and full text besides DC

* automatic citation linking across heterogeneous digital archives (APS, arXiv, and other physics collections)

* normalizing and automating the generation of missing metadata fields (for example, subject)

* equation searching--displaying and searching equations that are embedded in metadata fields such as title or abstract in formats such as LaTeX or MathML. Equations in these formats are not easy for users to browse or view.

* post-processing after search--the varying quality of metadata makes it difficult to build a unified search interface. Post-processing provides an alternative way to take advantage of richer metadata sets without complex search interface.

The development of the Archon system makes it clear that a rich search environment can be developed in a metadata harvesting system. However, it also reveals a lack of standards in this environment, such as metadata formats, controlled vocabulary, citation and reference information, and standard equation expression, which places the burden on service providers to understand all proprietary formats.

DP9

DP9 is an open-source gateway service that allows general search engines (for example, Google, Inktomi, etc.) to index OAI-PMH-compliant archives (Liu, Maly, Zubair, & Nelson, 2002). DP9 does this by providing a persistent URL for repository records and converting this to an OAI-PMH request to the appropriate repository when the URL is requested. Search engines that do not support the OAI-PMH can thus index the "deep Web" contained within OAI-PMH-compliant repositories.

Currently, indexing OM collections via an Internet search engine is difficult because Web crawlers cannot access the full content of an archive, are unaware of OAI-PMH, and cannot handle XML content very well. DP9 solves these problems by defining persistent URLs for all OAI-PMH records and dynamically creating a series of HTML pages according to a crawler's requests. DP9 provides an entry page and, if a Web crawler finds this entry page, the crawler can follow the links on this page and index all records in a data provider. DP9 also supports a simple name resolution service: once an OAI Identifier is given, DP9 responds with an HTML page, a raw XML file, or forwards the request to the appropriate data provider.

Various caching mechanisms are also implemented in the DP9 service to make it more crawler friendly. However, due to the limitation of Web crawling technology, there is no guarantee that a document can be indexed by any Web crawler. Further research, such as the mod_oai project (http://www.modoai.org/), is underway to ensure better integration between OAI-PMH and Web-crawling technology.

Digital Library Grid

When dealing with a large number of data providers and documents, we discovered that it is necessary to parallelize each individual module of the Arc system, such as metadata harvesting, indexing, and searching. This leads to the Mellon-funded Digital Library Grid project (http://saturn.seven.research.odu.edu/grid/index_new). The objective of the Digital Library Grid project is to develop a high-performance federated search service that exploits the resources of a grid. It will make available a large amount of information that is distributed amongst heterogeneous digital libraries. In this project, we are developing the software tools to


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale