The National Diet Library of Japan (NDL) is pleased to announce that “Ensuring long-term preservation and usability of digital information” has been published on its website. This page describes the needs to ensure long-term preservation and accessibility of digital information, including Internet resources and packaged digital publications such as CDs, DVDs and software.

The main contents are as follows:

  • Summaries of studies conducted by the NDL for ensuring long-term access to digital information.
  • Introduction of the NDL Digital Archiving System.
  • Links to the sources of information on the international standards, guidelines, projects, papers and reports, relating to long-term use and preservation of digital information.
  • “The Long-term accessibility of packaged digital publications (NDL Research Report No.6)” is also available (in English) as a PDF file (518KB). This is a compiled report of FY2003 and FY2004 studies about usability of packaged digital publications.

This is the Paradigm (Personal Archives Accessible in DIGital Media) project workbook, published by the Bodleian Library in 2007. It’s available on the web, free, from the Paradigm project website, but you can now get a printed version too, which is much easier to read over a nice cup of coffee, especially as it is nearly 300 pages long.

Paradigm was a project exploring the issues involved in the long term preservation of personal digital archives, by examining in particular the archives of contemporary UK politicians. Politicians and their offices produce a chaotic welter of digital media in various formats and in a variety of states of semi-organisation, so the Paradigm project is extraordinarily useful to repositories having to deal with electronic media which they accession from outside bodies (like where I work). The workbook is enriched by a section on legal issues surrounding digital preservation, and the appendices contain paperwork templates for digital repositories, such as a model gift agreement. Top stuff.

 

 

On Monday I attended the What to Preserve? The Significant Properties of Digital Objects conference at the British Library conference centre, jointly organised by JISC, the BL and the DPC. It was particularly nice to meet some people there whom I had previously only known through email. Here are my own notes on some of what was discussed on the day.

Read the rest of this entry »

The Koninklijke Bibliotheek in the Netherlands has produced a report Evaluating File Formats for Long-Term Preservation, available here, which introduces an evaluative scheme for assessing the fitness of a file format for preservation, and which then applies this scheme to two example formats, specifically MS Word 97-2003 doc format and PDF/A. Of course, identifying the winner of these two particular formats is easy (it might have been more interesting to see a closer contest such as ODF vs PDF/A) but it’s still an interesting exercise. The report was written by Judith Rog and Caroline van Wijk.

The scheme

Each file format is awarded a score on a particular criterion, such as “adoption: world wide usage” or “robustness: support for file corruption detection” and so on. The scores are weighted and then added together to give a total score. This total score then provides a quantifiable evaluation of how useful the format is as a way to preserve digital information for the long term.

Read the rest of this entry »

deegantanner.jpg Review of and notes from Julien Masanes’s chapter on web archiving in Deegan and Tanner’s Digital Preservation (2006). This book also contains a chapter by Elisa Mason looking at some web archiving case studies. Available at Amazon. 

Masanes asserts that the web is a dynamic information space; web archiving consists of capturing a local, limited version of one small part of this space, and freezing it. This is well tricky, for various reasons. 

  • Much of the web’s content is dynamic: a URI can link not only to a simple flat page, but also to a virtual page, which is really a set of parameters which is then interpreted to generate the displayed content.
  • The actual dynamics of these dynamic pages evolve rapidly, because it is so easy for people to code up a new protocol and share it around.
  • The concept of a definitive or final version does not apply to the web: web documents are always prone to deletion or updating. Any preservation undertaken is therefore by its nature a sampling done at a specific time.
  • The navigation paths are embedded within the documents themselves, so a web archive has to be constructed in such a way that this mechanism still works for preserved pages.
  • Automated crawlers or spiders often take a few days to navigate large websites. This can lead to temporal incoherence, in that a page linked from the home page may be different from how it was when the home page itself was crawled.
  • Size. Size. Size. An organisation’s website can easily run into hundreds of thousands of pages, comprising billions of individual objects such as graphics etc.

Read the rest of this entry »

There a handful of different strategies for archiving websites, of which a web-served archive is just one. The best example of a web-served archive is the Internet Archive .

Tech

The IA stores files of websites in warc container files. A warc file keeps a sequence of web pages and headers, the headers describing the content and length of each harvested page. A warc also contains secondary content, such as assigned metadata and transformations of original files.

Each record has an offset which is stored in an index ordered by URI. This means that it should be possible to rapidly extract individual files based on their URI. The selected files then get sent to a web server which forwards them to the client.

Doing it this way allows the naming of individual web pages to be preserved. It also scales up pretty well (the IA has a colossal amount of information).

Read the rest of this entry »

Podcast available as a video podcast from the NEWS! section at http://www.liv.ac.uk/lucas/

Duranti points out that digital preservation places some new obligations upon archivists in addition to the ones recognised under paper preservation theory, mainly to do with authenticity. The archivist has to become a “designated trusted custodian,” with input into record decisions at the very beginning of the record lifecycle. Relevant tradtional archivist responsibilities include:

Read the rest of this entry »

Geoffrey Brown of the Indiana University Department of Computer Science has a nice presentation available online which talks about the CIC Floppy Disk Project, and which along the way argues the case for emulation. The CIC FDP is intended to make publications deposited with federal US libraries available via FTP over the Web. In many cases this means saving not just the contents of the floppy, but also the applications needed to make the contents readable. One of his diagrams makes the point that emulation results in two separate repositories, one for documents and the other for software.

The project doesn’t appear to be strict emulation, in that some leeway is allowed. For instance, slide no. 16 bullets the software necessary for the project, one of which is Windows 98, even though “most disks were for msdos, Win 3.1”. I take that to mean that while most floppies were created on Win 3.1, they work just as well in Win 98, so let’s use Win 98 instead. Strict emulation theory probably isn’t too happy with that.

Slide 21 is the most interesting as it contains a handy summary of the problems of migration:

  • Loss of information (e.g. word edits)
  • Loss of fidelity (e.g. “WordPerfect to Word isn’t very good”). WordPerfect is one of the apps listed earlier as necessary for their emulation.
  • Loss of authenticity: users of migrated document need access to the original to verify authenticity [AA: but this depends on how you define authenticity, surely?]
  • Not always technically possible ( e.g. closed proprietary formats)
  • Not always practically feasible (e.g. costs may be too high)
  • Emulation may necessary anyway to enable migration.

dioscuri1.jpgDioscuri is (as far as I am aware) the first ever emulator created specifically with long term digital preservation in mind. It is available for download from Sourceforge, and the project’s own website is here.

This awesomely ambitious project began in 2005 as a co-operative venture between the Dutch National Archives and the Koninklijke Bibliotheek. The first working example came out in late 2007. The project has now been subsumed within the European PLANETS project.

Read the rest of this entry »