You are currently browsing the monthly archive for February 2008.
Geoffrey Brown of the Indiana University Department of Computer Science has a nice presentation available online which talks about the CIC Floppy Disk Project, and which along the way argues the case for emulation. The CIC FDP is intended to make publications deposited with federal US libraries available via FTP over the Web. In many cases this means saving not just the contents of the floppy, but also the applications needed to make the contents readable. One of his diagrams makes the point that emulation results in two separate repositories, one for documents and the other for software.
The project doesn’t appear to be strict emulation, in that some leeway is allowed. For instance, slide no. 16 bullets the software necessary for the project, one of which is Windows 98, even though “most disks were for msdos, Win 3.1″. I take that to mean that while most floppies were created on Win 3.1, they work just as well in Win 98, so let’s use Win 98 instead. Strict emulation theory probably isn’t too happy with that.
Slide 21 is the most interesting as it contains a handy summary of the problems of migration:
Loss of information (e.g. word edits)
Loss of fidelity (e.g. “WordPerfect to Word isn’t very good”). WordPerfect is one of the apps listed earlier as necessary for their emulation.
- Loss of authenticity: users of migrated document need access to the original to verify authenticity [AA: but this depends on how you define authenticity, surely?]
- Not always technically possible ( e.g. closed proprietary formats)
- Not always practically feasible (e.g. costs may be too high)
- Emulation may necessary anyway to enable migration.
Dioscuri is (as far as I am aware) the first ever emulator created specifically with long term digital preservation in mind. It is available for download from Sourceforge, and the project’s own website is here.
This awesomely ambitious project began in 2005 as a co-operative venture between the Dutch National Archives and the Koninklijke Bibliotheek. The first working example came out in late 2007. The project has now been subsumed within the European PLANETS project.
Time for a round-up:
Always keep the original bitstream of a digital resource, no matter what application it was created in, proprietary or not.
- Never let the user consult the original. She can consult a copy. This copy does not have to be in the same format as the original, unless the user herself demands it.
- When the time comes to migrate the data, migrate it to an open, XML-based format, if possible. But still keep the original.
- Keep metadata separate from the original data object.
- Store everything on a dedicated server, with backups, rather than offline media.
And remember the two laws of digital preservation, which follow irrefutably from the fact that no one has or will ever come back in time from the future to tell us what will work:
No one knows anything.
No one will ever know anything.
That’s the position today, at least.
Publisher: John Wiley & Sons (23 Jan 2004) ; ISBN-10: 0471453803. Available from Amazon.
OK, so it’s a book about digital security, not about digital preservation. But if there was a book on digital preservation as well written as this then I doubt we would have any problems in getting our message across. Well worth reading.
There are two particular aspects which jumped out as being indirectly relevant to digital preservation concerns, both to do with the interaction of humans with computers:
There is no such thing as a computer system; there are only computer-with-human systems. Well I’m paraphrasing Schneier there, but it’s the sort of thing he would say, and he argues that it is the case. It is pointless to buy a digital security package and then leave the password on a Post-it note gummed to the monitor. It is pointless to invest in 128-bit encryption if the password you choose will be your cat’s name. It is pointless to set up a cutting edge firewall if you pay your staff so little that they will be bribed by a guy in the pub to burn the data onto a CD anyway. Schneier is making the point that an ICT system, by itself, is meaningless: it exists in a world full of humans, and we need to make sure the human elements are as trustworthy as the technical ones. This strikes me as being indirectly relevant to digital preservation. We argue lots about technical aspects – emulation, migration, file formats, metadata, XML etc – but we need to train ourselves up in human pyschology and understand exactly how people will interact with our proposed systems.
Humans don’t do work on data; only progams do. (Another paraphrase there.) Schneier’s explicit point is about encryption, such as PGP. Very often you read statements like “Alice encrypts a message with Bob’s public key, which Bob can then decrypt because he has his own private key.” But in reality, nothing of the sort ever happens. Instead Alice presses a key on her computer. An application then encrypts the message. Nor does Bob decrypt. Instead he presses a key on his own computer, and the computer does the decrypt. Alice is trusting her computer, her OS and the app to do their job, and trusting that the encryption software company haven’t rigged up a backdoor. Bob, too, is trusting a whole load of people that he has never met, purely because he has bought their software.
There is an analogy here with digital preservation, as Schneier’s point can be extrapolated across to migration and emulation. When someone says “we can emulate X on Y” what they actually mean is “there is company claiming that X can be emulated on Y, and I am trusting them.” Or: “there is a company claiming that their software can automatically migrate 1,000,000 files from file format X to file format Y with no loss of information content, and I am trusting them.” Or: “there is a company claiming that their checksum software proves fixity in refeshing data, and I am trusting them.” Ultimately we do not trust the technology, we have to trust the people behind the technology.
Back in January 2004 Brian Lavoie of the OCLC produced a Technology Watch Report for the Digital Preservation Coalition which gives an introduction and commentary on the OAIS model. It’s a useful paper to read as it is always interesting to discover what someone else thinks of OAIS and what are its main aspects and implications. The report is available on the DPC website here.
The points made by Brian which I found most illuminating are:
Intellectual Property Rights. It is not enough to acquire the objects themselves. If the OAIS has to migrate the object to a new format, it must have the explicit right to do so.
Designated Community. The DC lies at the core of the whole digital preservation exercise. “It is the scope of the Designated Community that determines both the contents of the OAIS and the forms in which the contents are preserved, such that they remain available to, and independently understandable by, the Designated Community.” In other words, the DC is not determined by the archive’s holdings, but vice versa.
Brian also points out the difference between Consumers and the DC, a distinction which had passed me by. The contents of an OAIS might be freely available to anyone, but the DC is only that group of individuals possessing sufficent specialised knowledge to use the archived objects without expert assistance. For instance, the DC for a financial organisation might be financial professionals. The Representation Information needs to be geared towards them, even if the same info can be viewed by the general public. If you actually define the DC as the general public then your RI needs to become much larger and more comprehensive.
Archival storage. Neither Producers nor Consumers directly access the AIPs, or have any form of direct contact with the Archival Storage entity.
The meaning of “OAIS-compliant.” Brian says this is actually a very vague term, and might only mean that a digital repository’s architecture and data model can be mapped across to OAIS in some way. Organisations sometimes claim OAIS compliance without clarifying what they actually mean.
Most creators of digital records do not care tuppence about the long term preservation of their documents, which is why people in the digi pres field continually try to raise awareness of the issues.
Which prompts a question – does successful emulation undermine our efforts? If the creators of records believe that someone 75 years from now will create a succesful emulator which will run Excel 2003 (say), then there is no pressure on them to create their records now in any other format, is there? Creators can carry on creating records in closed, proprietary formats, to their hearts’ content. Every new report of a successful emulation project is yet another nail in the coffin of trying to persuade creators to use different formats.
News article about this available here at Newsfactor.com.
The Blue Ribbon Task Force on Sustainable Digital Preservation and Access is yet another project looking at how we can store things for “aeons.” Although they have only just begun, it seems likely from the article that they are going to (a) recommend the migration route rather than the emulation one, and (b) suggest the data is stored on a network of scattered digital repositories.
An e-government based interoperability network. One of its guiding principles is that UK public sector institutions should not become dependent on non-interoperable software products, because that could lead to monopoly or market failure.
In practice e-GIF explicity states that (a) UK bodies should adopt XML as the primary standard for data integration and presentation, and (b) they should adopt the eGovernment Metadata Standard (eGMS) for metadata. Adherence to eGIF is mandatory for UK public institutions. The eGIF Accreditation Authority checks compliance.
Key eGIF documents are available here.
Approved file formats
The useful info for digital preservation is contained with Section 7 of the Technical Standards Catalogue, version 6.2 (September 2005), available here. This section contains a list of approved file formats for various purposes. The approved e-GIF formats are:
text and word ptocessing: rtf, txt, htm, doc, pdf, nsf, mht
spreadsheets: csv, xls
presentation: ppt, pps
images: jpg, gif, png, tif, ecw
vector: svg, vml
audio: mp3, wav, avi, mov, qt, asf, wma, wmv, swf, ra, ram, rmm, Ogg Vorbis and a few others
compression: zip, gz, tgz, tar
character sets: UNICODE and ISO 10646.