Lessons learned with Arc, an OAI-PMH service provider
Library Trends, Spring, 2005 by Xiaoming Liu, Kurt Maly, Michael L. Nelson, Mohammad Zubair
* harvesting and parsing richer metadata formats and full text besides DC
* automatic citation linking across heterogeneous digital archives (APS, arXiv, and other physics collections)
* normalizing and automating the generation of missing metadata fields (for example, subject)
* equation searching--displaying and searching equations that are embedded in metadata fields such as title or abstract in formats such as LaTeX or MathML. Equations in these formats are not easy for users to browse or view.
* post-processing after search--the varying quality of metadata makes it difficult to build a unified search interface. Post-processing provides an alternative way to take advantage of richer metadata sets without complex search interface.
Related Results
The development of the Archon system makes it clear that a rich search environment can be developed in a metadata harvesting system. However, it also reveals a lack of standards in this environment, such as metadata formats, controlled vocabulary, citation and reference information, and standard equation expression, which places the burden on service providers to understand all proprietary formats.
DP9
DP9 is an open-source gateway service that allows general search engines (for example, Google, Inktomi, etc.) to index OAI-PMH-compliant archives (Liu, Maly, Zubair, & Nelson, 2002). DP9 does this by providing a persistent URL for repository records and converting this to an OAI-PMH request to the appropriate repository when the URL is requested. Search engines that do not support the OAI-PMH can thus index the "deep Web" contained within OAI-PMH-compliant repositories.
Currently, indexing OM collections via an Internet search engine is difficult because Web crawlers cannot access the full content of an archive, are unaware of OAI-PMH, and cannot handle XML content very well. DP9 solves these problems by defining persistent URLs for all OAI-PMH records and dynamically creating a series of HTML pages according to a crawler's requests. DP9 provides an entry page and, if a Web crawler finds this entry page, the crawler can follow the links on this page and index all records in a data provider. DP9 also supports a simple name resolution service: once an OAI Identifier is given, DP9 responds with an HTML page, a raw XML file, or forwards the request to the appropriate data provider.
Various caching mechanisms are also implemented in the DP9 service to make it more crawler friendly. However, due to the limitation of Web crawling technology, there is no guarantee that a document can be indexed by any Web crawler. Further research, such as the mod_oai project (http://www.modoai.org/), is underway to ensure better integration between OAI-PMH and Web-crawling technology.
Digital Library Grid
When dealing with a large number of data providers and documents, we discovered that it is necessary to parallelize each individual module of the Arc system, such as metadata harvesting, indexing, and searching. This leads to the Mellon-funded Digital Library Grid project (http://saturn.seven.research.odu.edu/grid/index_new). The objective of the Digital Library Grid project is to develop a high-performance federated search service that exploits the resources of a grid. It will make available a large amount of information that is distributed amongst heterogeneous digital libraries. In this project, we are developing the software tools to
- 5 Rules for Immediate Annuities
- Death in the Family: 12 Things to Do Now
- Dumbest Things You Do With Your Money
- 6 Online Networking Mistakes to Avoid
- 401(k) Mistakes to Avoid
- 5 Economic Scenarios to Keep You Up at Night
- The Real ‘Best Places to Retire’
- Best Credit Cards for You
- 12 Tough Questions to Ask Your Parents
- The Real ‘Best Colleges’
- Home Buyer Tax Credit: How to Cash In
- Why You Shouldn't Bash Cash
- 8 Phony 'Bargains' and Better Alternatives
- Danger: 3 Debit Card Scams to Avoid
- 6 Myths About Gas Mileage
- 29 Fees We Hate Most
- Quick and Easy Ways to Boost Returns
- Best Stocks to Buy Now
- Lower Your Taxes: 10 Moves to Make Now
- New Jobs: 8 Lessons from Real-Life Career Switchers
- The New Job Market: Who Wins and Who Loses?
- Health Care Reform's Public Option: Everything You Need to Know
- Volunteer Work When Unemployed: Should You Work for Free?
- Whose Recovery Is This?
- Long-Term-Care Insurance: 4 Biggest Risks to Avoid
Content provided in partnership with
Most Recent Reference Articles
- A Maryland state trooper gave Erik Bonstrom an $80 ticket for driving too slowly
- In California, postal worker Dean Hudson has been found guilty
- Alec Loorz, the 15-year-old founder of Kids vs. Global Warming and recent Brower Youth Award recipient, went to Congress in November for a press conference with Senators Barbara Boxer and John Kerry, who are championing legislation to stabilize US greenho
- Foreign exchange
- The buzz on bees
Most Recent Reference Publications
Most Popular Reference Articles
- 9 questions to ask your new lover: what you were afraid to ask, but always wanted to know
- A world without nuclear weapons?
- How Tyler Perry rose from homelessness to a $5 million mansion
- Rejoice anyway - Zephaniah 3:14-20, Philippians 4:4-7 - Living by the Word - Column
- Medical education's dirtiest secret - use of medical residents



