Review of and notes from Julien Masanes's chapter on web archiving in Deegan and Tanner's Digital Preservation (2006). This book also contains a chapter by Elisa Mason looking at some web archiving case studies. 

Masanes asserts that the web is a dynamic information space; web archiving consists of capturing a local, limited version of one small part of this space, and freezing it. This is well tricky, for various reasons. 

  • Much of the web’s content is dynamic: a URI can link not only to a simple flat page, but also to a virtual page, which is really a set of parameters which is then interpreted to generate the displayed content.
  • The actual dynamics of these dynamic pages evolve rapidly, because it is so easy for people to code up a new protocol and share it around.
  • The concept of a definitive or final version does not apply to the web: web documents are always prone to deletion or updating. Any preservation undertaken is therefore by its nature a sampling done at a specific time.
  • The navigation paths are embedded within the documents themselves, so a web archive has to be constructed in such a way that this mechanism still works for preserved pages.
  • Automated crawlers or spiders often take a few days to navigate large websites. This can lead to temporal incoherence, in that a page linked from the home page may be different from how it was when the home page itself was crawled.
  • Size. Size. Size. An organisation’s website can easily run into hundreds of thousands of pages, comprising billions of individual objects such as graphics etc.

Review in JSA vol 28 no 2, October 2007, by Caroline Shenton.

Brown’s purpose is to provide a broad overview of the subject, aimed at policy makers and webmasters, although Shenton points out that this book would be useful to ICT professionals too. Brown avoids discussing the details of technical methodologies, in order to prevent his book from becoming quickly outdated. Brown covers aspects such as the models and processes for selection, the main methods of web archiving, QA and cataloguing issues, legal issues, and some speculations about the future. Shenton thinks the only real aspect which Brown has missed is the issue of cost. Overall she rates it highly, so I’ll need to add this one to my reading list.

deegantanner.jpg Digital Preservation (Digital Futures Series) (Hardcover), by Marilyn Deegan (Editor), Simon Tanner (Editor). Hardcover: 260 pages; Publisher: Facet Publishing (18 Sep 2006); ISBN-10: 1856044858. Available at Amazon.

This is the most recent book published in the UK on digital preservation, and if I can speak from a parochial viewpoint for a bit, it’s nice to have a UK slant on things, with details given about UK projects. This means that Digital Preservation contains some practical information which is not present in Borghoff et al.

By Stuart D. Lee, 2002. Reviewed by Richard M. Davis in JSA vol 23 no 2, 2002.

Aimed at librarians and information science students, so it deals mainly with electronic format published material within a library context. Recommends using published, open standards for data storage and exchange, to best preserve data beyond the life of the host system. ‘But of course publishers have much the same reservations about giving us those sorts of freedoms as record companies do about us ripping and burning our own CDs!’