I have added a new page called “So you want to keep all your stuff?“, which is aimed at the home PC user. I’ve tried to make it as straightforward as possible, in the hope that anyone who stumbles across it will get a useful guide on how to preserve their own documents.

Any comments about how to improve it, or make it clearer, are very welcome.

borghoff.jpg Notes from Borghoff et al. Emulation has some notable advantages over migration, not least that it guarantees the greatest possible authenticity. The document’s original bitstream will always remain unchanged. All (!) we have to do is make sure that a working copy of the original app is available. As it’s impossible to keep the hardware running, we have to emulate the original system on new systems.

In theory there are no limitations on the format of the record- even dynamic behaviour should be preserved ok. But there are three massive worries with emulation: (a) can it be achieved at reasonable cost?, (b) is it possible to resolve all the copyright and legal issues involved in running software programs over decades? and (c) will the human-computer interface of the long term future be able to cope with the mouse-and-keyboard interface of today’s applications? The only realistic way to answer (c) would be to create a “vernacular copy” (p.78) but this strikes me as migration under a different name – just my own thought.

“DigitalPreservationEurope is pleased to announce the release of the second in a series of thought provoking and controversial position papers on a range of issues surrounding digital preservation, ‘So Where is the Black Hole in our Collective Memory?’. It is our intention that these papers will promote vigorous debate within the digital preservation community and encourage people to think about digital preservation in new and innovative ways by exploring and challenging the received wisdom.

Harvey’s position paper asks important questions: Have the digital preservation community cried wolf too often? Are our strident, alarmist proclamations about the loss of digital materials too extreme? He argues that our inability to bring evidence to bear in support of such claims leave us exposed and easily overlooked.

The paper comes out with the standard revisionist line, ie that examples of data loss are in fact examples of near data loss, or indeed data recovery. Useful to have a summary of the Usual Suspects: Viking lander (data recovered), BBC Domesday (data recovered), first email [AA who cares?], first website [AA ditto], 1960 US census data (data recovered). I have my own experience of this with FIF images.

The paper however does not mention that these data archaeology projects were expensive: good digital preservation policies would have prevented the data from becoming endangered in the first place. Moreover, these were all successful data projects. I wonder if there are examples out of there where the data archaeology was left too long?

American legal firms are now using online storage for their digital preservation:

But is this such a good idea? The impression I get is that the files are kept in their original file formats. If the file format turns out to be unreadable 10 year later, that is not the online storage company’s problem, but the legal firms’ problem?

Here’s what got approved last year:

  • Records are always accepted for preservation if they (a) meet the terms of the normal collecting policy and (b) are in a format openable on the current IT platform. If the records meet condition (a) but not (b), the accession will be discussed first by the Technical Services manager.
  • Records are preserved only in popular or well-supported file formats, whether proprietary or not (eg .doc, .jpeg). The full list appears below.
  • Accessions are revisited every 12 months to check that the file format is not in danger of becoming obsolete. If the format is in danger, then the records are migrated to a replacement format.

What could be added, perhaps:

  • The end user customer will not be able to consult the original record, only a copy of that record.
  • Records will not necessarily be made available in the same format that they were created in.
  • The original bitstream of the record will be kept alongside any migrated versions, to enable a future emulation to be carried out, if that is deemed necessary.

Approved file formats:

bmp image file
csv comma separated value
doc document file
dot document template
gif image file
htm web page
html web page
jpeg image file
jpg image file
mdb database file
pdf portable document format
ppt presentation file
prn space delimited spreadsheet
psd image file
pub desktop publishing file
rtf rich text format
tab tab delimited spreadsheet
tif image file
tsv tab separated value
txt text file
wav windows audio file
wma windows media file
xls spreadsheet
xlt spreadsheet template

I’m not aware of any costings which have actually been done, but my gut feeling is that the balance digital vs. paper comes out in favour of digital. (By “paper” I’m including parchment, photographs etc too.)

1. The biggest single ongoing cost in any repository is staffing. A paper-based archives service has to run searchrooms for users to consult the materials, where users are supervised and security is ensured. So, paper-based repositories have to employ receptionists, searchroom assistants, relief staff to cover when other staff are away etc. A digital repository which makes its assets available over the web does not incur any of these costs.

There is the issue of authenticity. The individual printing out the record often has a certain level of control over how that document is printed: fields or text can be removed from the printed version even if they remain in the digital original. Printing from spreadsheets usually results in the paper copy having only values and calculated data, not the formulas, or comments. This means that a paper document cannot necessarily be trusted as a full and complete equivalent of a digital record. Yet many people will allow the digital original to be deleted, or get lost, after the paper copy has been created. This may not be an issue for your home computer, but it may well be an issue in an organisation where different members of staff are printing different things.

How do you access paper? – need a supervised searchroom, really, with all the costs that entails. And BS5454 storage. Digital preservation is actually cheaper than paper, if properly handled.

Digital records have a feature not present in paper ones, namely behaviour. A paper document is a fixed item, but digital documents are sometimes interactive, and for some of these the behaviour is an essential part of the meaning. Spreadsheets are a good example.

Also, for some organisations there is a legal aspect. If the original document is digital, then it has to be preserved digitally.

Unlike Microsoft’s suite, which creates files in a proprietary formats, Open Office’s files are in formats which are open. ODF (Open Document Format) is an ISO standard and a European Union recommendation. OpenOffice Writer can itself convert a file from DOC to ODT. Unfortunately the conversion doesn’t always work. (Give examples?)

“The other disadvantage of Open Document Format is that even for simple documents it is extremely complex. For example, unzipping a one-page document of about 120 words results in a collection of files totalling 300K in size. This makes it relatively difficult to locate the meaningful content and structure and transform it into other formats for viewing or other uses. Instead of leaving documents in this complex format and having a hard job writing converters (XSLT stylesheets) for all possible future uses, it would be better to store documents in a simple, clear, well-structured format that makes converters easier to write.” (Ian Barnes of the Australian National University, Preservation of word processing documents (2006), available at here, accessed 29.11.07.)

Issues with the ZIP format too. ZIP is ok now, as ZIP files can be opened by any major platform, and that doesn’t look as if it is going to change. On the other hand, a corruption in the file can result in the loss of the entire file.

RTF is Rich Text Format. Microsoft have published the format, so in that sense it’s open, but it has some ‘quirks.’ Better check all this.

The National Library of Australia has saved much of its stuff in RTF.