Web Archiving for the Rest of Us: How to Collect and Manage Websites Using Free and Easy Software

Computers in Libraries, Sep 2009 by Dunn, Katharine, Szydlowski, Nick

If you want to maintain your web archive into the future, you will nee

How ephemeral is content on the internet? Can a library reasonably expect to collect each new version of its website or blog as it would collect every issue of a magazine or journal? Large-scale projects such as the Internet Archive (www.ar chive.org) send out crawlers to gather snapshots of much of the web. This massive collection of archived websites may include content of interest to your patrons. But if you want to control exactly when and what is archived, relying on someone else to do the archiving isn't ideal.

Though large-scale web archiving and preservation are outside the mission of most libraries, it is possible for even the smallest operations to maintain an archive of a group of sites. This article outlines a simple workflow for Macs that uses free software and requires only the most basic technical expertise. There is no programming involved, and you won't have to touch a command-line interface. This is web archiving for the rest of us.

We developed this process in response to an assignment at the Simmons College Graduate School of Library and Information Science. At Simmons, Katharine is a student and an editorial fellow, and Nick is a student while he works as a library assistant in preservation services at the Massachusetts Institute of Technology (MIT) Libraries. We have put this process to the test on relatively static webpages, blogs, and even the social networking site Twitter.

Workflow

There are just three things you need to do to create an archive of websites. (Preservation is another matter; we'll discuss this briefly at the end of the article.)

1. Harvesting: First you must acquire, or harvest, the content you are collecting. Because many websites change frequently, you will need to reacquire a current version of the site at some designated interval.

2. Version Control: Once you have harvested multiple versions of a site, you will need to keep track of the versions so you can determine which iterations are different enough from one another to be worth keeping.

3. Presentation: The reason to archive sites is to use them again, right? How you choose to make your new web archive available to your patrons will depend on what resources you have at your disposal. But it is possible to view the files on a computer (PC or Mac), burn them to a CD, or even place them on a server for remote viewing.

Harvesting

In order to efficiently acquire entire sites, we chose to use a piece of software called SiteSucker. This soft- ware was designed to rapidly download entire websites and place the files on your hard drive in a folder structure that mirrors the way the site was arranged on its server. SiteSucker is donation-ware and can be downloaded from www.sitesucker.us. PC and Linux users will find that the free software HTTrack (available at www.httrack.com) offers similar functionality.

To download a site using SiteSucker, simply type the URL of the site into the box marked "Web URL" and press the "Download" button. SiteSucker will follow the links on each page, downloading every file that a user can access to a user-designated set of folders on your hard drive.

Clicking on the "Settings" icon gives you much greater control over the downloaded files. For example, it is possible to specify which file extensions the program should download and which it should ignore. Use the settings to limit or expand the number of files downloaded and to ensure that you get the files you need. It is possible to save your settings so that each time you download a page, SiteSucker performs the same tasks.

When you're learning to use Site- Sucker, it is best to start with a target webpage that consists primarily of plain HTML rather than a site where content is held in a database and has pages generated by scripts. If you have trouble determining the difference, look for pages that end with the file extensions .htm or .html - these are more likely to be plain HTML. SiteSucker cannot follow links that appear within JavaScript, so some types of sites will not render correctly when harvested with this program. We discuss the details of using SiteSucker on database backed sites later.

Version Control

In the software development community, the term "version control" refers to the process of managing the updates of files and documents, such as software code, that may be edited or changed many times by multiple users. Version control software such as CVS or Subversion makes it possible to return to any past version of a file. In essence, version control gives you intellectual control over the content you collect - and if you have born- digital content, it can be essential when performing traditional library and archive tasks such as cataloging, appraisal, and providing access. Version control helps us reach one of the end goals of web archiving: making past versions available to users.

When we first started working on this project, we attempted to use Subversion, a popular and free version control package, to track different iterations of the sites we downloaded. However, as nonprogrammers unaccustomed to using a command-line interface, we struggled to produce usable results with Subversion. As a tool created for the software development community, it is perhaps best suited for those with some programming skills.


 

BNET TalkbackShare your ideas and expertise on this topic

Please add your comment:

  1. You are currently: a Guest |
  2.  

Basic HTML tags that work in comments are: bold (<b></b>), italic (<i></i>), underline (<u></u>), and hyperlink (<a href></a)

advertisement
advertisement
  • Click Here
  • Click Here
  • Click Here
advertisement

Content provided in partnership with ProQuest