deegantanner.jpg Review of and notes from Julien Masanes’s chapter on web archiving in Deegan and Tanner’s Digital Preservation (2006). This book also contains a chapter by Elisa Mason looking at some web archiving case studies. Available at Amazon. 

Masanes asserts that the web is a dynamic information space; web archiving consists of capturing a local, limited version of one small part of this space, and freezing it. This is well tricky, for various reasons. 

  • Much of the web’s content is dynamic: a URI can link not only to a simple flat page, but also to a virtual page, which is really a set of parameters which is then interpreted to generate the displayed content.
  • The actual dynamics of these dynamic pages evolve rapidly, because it is so easy for people to code up a new protocol and share it around.
  • The concept of a definitive or final version does not apply to the web: web documents are always prone to deletion or updating. Any preservation undertaken is therefore by its nature a sampling done at a specific time.
  • The navigation paths are embedded within the documents themselves, so a web archive has to be constructed in such a way that this mechanism still works for preserved pages.
  • Automated crawlers or spiders often take a few days to navigate large websites. This can lead to temporal incoherence, in that a page linked from the home page may be different from how it was when the home page itself was crawled.
  • Size. Size. Size. An organisation’s website can easily run into hundreds of thousands of pages, comprising billions of individual objects such as graphics etc.

 

Gathering content should not be a real problem. It is a piece of cake to copy the files across from a server (although there is the possibility of failing to copy orphaned, un-linked pages, because http does not provide a full list of the documents within a directory, unlike ftp). No, the real problem lies in recreating the original architecture of the website. Because this architecture is dependent on a specific OS, server configuration and app environment, it would “even be difficult to re-create from scratch for their designers and managers” (p.84). Web archiving strategies usually transform the archived site into something a little more preservable.

There are three available strategies.

Strategy 1: local file system archive 

This simply involves making a local copy of the site’s file’s and navigating through the copy in a pseudo-web manner, by making the browser navigate file:// rather than http://. All the links within the documents need converting to relative ones on the local system. (This is the approach I did myself with my Cardboard Stonehenge blog a few months ago, copying it onto my home PC’s hard drive and making the links relative.)

This is OK for small-ish sites but Masanes points out a few concerns. Firstly, file-naming conventions are different from web ones, so you might need to do more than just replace http with file. Secondly, making the embedded links relative might be non-trivial to do, and it certainly diminishes the authenticity of the preserved file as a record. Thirdly, the hierarchical structure of a file system may not be the best way to capture something as fluid and as dynamic as a website, anyway. A file system is designed for a particular task, and it can easily break down when dealing with billions of files.

Strategy 2: web-served archive

This involves storing all the harvested files from a website within a single container file. The container captures relevant metadata too, including any transformations made. Offsets of individual records are stored in a separate searchable index. Requested records are passed from the file index system to a web server which then sends them to the client. This approach is more complicated and requires more software, but it can cope with enormous amounts of data (see the Internet Archive).

Strategy 3: non-web archive

This involves the abandoning of any attempt to navigate the preserved site through link-based hypertext, and replacing it with something else: a catalogue-based logic, for example, or even transforming an entire site into a single flat-page PDF document. This makes sense only if the hypertext nature of the pages is irrelevant, for example for sites created by the digitisation of a paper-based catalogue or collection.

Site-centric archiving 

The main sampling approach relevant for my research purpose is site-centered archiving (rather than topic or domain-centered). Organisations often discover a legal obligation to keep a record of what their website claims or has claimed. Such sites can be archived using a simple website copier (HTTRack) or the archiving functionality of their CMS system.

Extensive or intensive?

A small site can be captured completely. But when faced with a massive amount of data (like the Internet itself) some sort of prioritisation has to be worked out. Because web crawlers do not cope well with the deep web, strategies like that adopted by the Internet Archive tend to go for extensive rather than intensive completeness.

Metadata…

The strategy and approach adopted need to be spelled out clearly in relevant metadata, alongside other metadata elements. Masanes has worked out a potential web archiving metadata scheme, the IIPC Web Archiving Metadata Set, available here.

Alan’s thoughts 

It would have been nice to see some discussion by Masanes about the technicalities of web archiving. Websites are accessed through browsers, and any web archive worthy of the name is going to be accessed through a browser too. This has worked ok so far because HTML hasn’t changed terribly much over the past 12 years. So, your browser today can easily open a well-formed HTML page harvested from 1996. (A badly-formed page might still be unopenable, however.) But no one knows what is going to happen to HTML, Javascript, Flash, Silverlight etc over the next 100 years. Or even just the next 12. The normal browser of 2098 might be unable to open today’s web pages even if they have been preserved. Instead the 2098 browser will need to run an emulator so that it thinks it’s a 2008 browser. This might be an unlikely scenario, but it would be interesting to discuss just how unlikely.

It occurs to me too that there are also IPR and moral issues affecting web preservation which do not affect paper preservation, nor (arguably) do they affect other forms of digital preservation either. These issues arise from the web as being a publishing medium.

To take an extreme example, it is not a crime for an archive repository simply to store a libellous document, as far as I am aware. It is not a crime either to make this document available to a researcher in a searchroom. But it would be a crime if the repository then published the text of the document, because it is the publication aspect which the law is interested in.

This has implications for a web-served archive, in that delivery of the document to the searcher takes the form of online publication. The Scientologists realised this, which is why they succeeded in getting the Internet Archive to remove harvested websites which were derogatory about Scientology. It is pointless arguing that a harvested site is “old”, because its the publication of it which matters, and online publication is always “now”. The Internet Archive might preserve a defunct harvested site from 1996, but when it makes the site available to a researcher today then it’s legally a live, current publication.

Advertisements