You are currently browsing the monthly archive for March 2008.
The Koninklijke Bibliotheek in the Netherlands has produced a report Evaluating File Formats for Long-Term Preservation, available here, which introduces an evaluative scheme for assessing the fitness of a file format for preservation, and which then applies this scheme to two example formats, specifically MS Word 97-2003 doc format and PDF/A. Of course, identifying the winner of these two particular formats is easy (it might have been more interesting to see a closer contest such as ODF vs PDF/A) but it’s still an interesting exercise. The report was written by Judith Rog and Caroline van Wijk.
Each file format is awarded a score on a particular criterion, such as “adoption: world wide usage” or “robustness: support for file corruption detection” and so on. The scores are weighted and then added together to give a total score. This total score then provides a quantifiable evaluation of how useful the format is as a way to preserve digital information for the long term.
Review of and notes from Julien Masanes’s chapter on web archiving in Deegan and Tanner’s Digital Preservation (2006). This book also contains a chapter by Elisa Mason looking at some web archiving case studies. Available at Amazon.
Masanes asserts that the web is a dynamic information space; web archiving consists of capturing a local, limited version of one small part of this space, and freezing it. This is well tricky, for various reasons.
- Much of the web’s content is dynamic: a URI can link not only to a simple flat page, but also to a virtual page, which is really a set of parameters which is then interpreted to generate the displayed content.
- The actual dynamics of these dynamic pages evolve rapidly, because it is so easy for people to code up a new protocol and share it around.
- The concept of a definitive or final version does not apply to the web: web documents are always prone to deletion or updating. Any preservation undertaken is therefore by its nature a sampling done at a specific time.
- The navigation paths are embedded within the documents themselves, so a web archive has to be constructed in such a way that this mechanism still works for preserved pages.
- Automated crawlers or spiders often take a few days to navigate large websites. This can lead to temporal incoherence, in that a page linked from the home page may be different from how it was when the home page itself was crawled.
- Size. Size. Size. An organisation’s website can easily run into hundreds of thousands of pages, comprising billions of individual objects such as graphics etc.
There a handful of different strategies for archiving websites, of which a web-served archive is just one. The best example of a web-served archive is the Internet Archive .
The IA stores files of websites in warc container files. A warc file keeps a sequence of web pages and headers, the headers describing the content and length of each harvested page. A warc also contains secondary content, such as assigned metadata and transformations of original files.
Each record has an offset which is stored in an index ordered by URI. This means that it should be possible to rapidly extract individual files based on their URI. The selected files then get sent to a web server which forwards them to the client.
Doing it this way allows the naming of individual web pages to be preserved. It also scales up pretty well (the IA has a colossal amount of information).
Podcast available as a video podcast from the NEWS! section at http://www.liv.ac.uk/lucas/
Duranti points out that digital preservation places some new obligations upon archivists in addition to the ones recognised under paper preservation theory, mainly to do with authenticity. The archivist has to become a “designated trusted custodian,” with input into record decisions at the very beginning of the record lifecycle. Relevant tradtional archivist responsibilities include: