On Monday I attended the What to Preserve? The Significant Properties of Digital Objects conference at the British Library conference centre, jointly organised by JISC, the BL and the DPC. It was particularly nice to meet some people there whom I had previously only known through email. Here are my own notes on some of what was discussed on the day.

General approaches to digital preservation

Andrew Wilson (National Archives of Australia) said that there are four basic approaches:

  • techno-centric: keeping the original hardware and software going for as long as possible
  • data-centric: keeping the data at the expense of the original application, such as converting records to PDF – this is the approach the NAA use
  • process-centric: keeping the original applications and processes at the (possible) expense of easy re-use of the data, ie. emulation of old hardware and software
  • post hoc: not doing anything, and then trying to dig out the data forensically when it’s needed by using expensive data archaeology techniques.

In practice most organisations use migration, and three general trends have emerged within this approach:

  • migration at obsolescence ie. at the last possible moment, when the file format is unsupported and beginning to expire. The records are migrated to new file formats or to current versions of old file formats
  • migration on ingest ie. when accepted by the digital repository the record is migrated to one of a small range of approved file formats (PDF, TIFF etc)
  • migration on request ie. not migrating it upfront at all, but only worrying about it when someone asks (and you therefore may need to use data archaeology techniques).

Authenticity and significant properties

AW in his keynote bit said authenticity is the big issue. Authenticity cannot lie in the bitstream because migration to any new format will replace the old bitstream with a new one. A bitstream approach misses the point, anyway. A digital record is a “performance” in the sense that it does not exist as an object in the real world, but only exists as data + software + hardware = the performance. What we are trying to preserve is the authenticity of the performance as experienced by the original user, not the authenticity of the data bitstream. This is why we need to focus on Significant Properties. We have to decide what aspects of the original file need preserving, and what can be lost.

Steven Grace of the InSPECT project stated that the problem with significant properties is that significance is a human concept, and can only be determined by humans. An automatic file analyser (if there is such a thing) might be able to tell us that a given file uses a given colourspace, for instance. But it cannot tell us whether those colours are significant or not. If it’s a piece of artwork or design, they probably are. If it’s a map or a graph, well they may be significant, but then again they may not be (different areas could be represented with different styles of monochrome hatching or shading, with no actual information loss). It might even be the case that the graph is irrelevant, and that it’s the underlying data values which need preserving. We need to ask the creator.

So it’s a very difficult thing to automate. Nevertheless automation is exactly what the EU-funded PLANETS project is looking at, said Adrian Brown. Other speakers pointed out that automation has to happen, in the long run, otherwise we are all doomed.

Cal Lee (University of North Carolina) split significant properties into four types:

  • Supported properties: ones inherent in the file format, eg Excel format supports font styles while csv format doesn’t, PDF/A does not support embedded moving images etc. They are all defined by the file format specs. So a human doesn’t really need to collect these.
  • Observed properties: the supported properties which actually have been used by the records’ creators. Not every spreadsheet will use all the functions Excel is capable of. This is much trickier.
  • Measured properties: ones which an automated system for checking significant properties (if there was one) could look for.
  • Intended properties: those properties which the creator actually intended to convey.

During the discussion at the end of the day someone pointed out that significant properties tie in with the designated community, as well. “Significant” means “significant to somebody” so how we define our user has a bearing on what properties we keep. This then spiralled into a discussion about AIPs vs DIPs. The DIP needs only to have those properties significant for that user – a different user could have a different DIP derived from the same AIP. Which means that an AIP has to keep everything, pretty much, rather than just those aspects we deem to be significant. We cannot predict what future users might want, so we have to keep it all. Chris Rusbridge then pointed out that this ramps up the cost of digital preservation, because the more properties we try to keep, the harder work it all is, and the more expensive it becomes.

Interesting snippets about this, that and the other

  • Mike Stapleton pointed out that there’s an issue with migrating interlaced TV images to non-interlaced digital formats. A video signal is a moving beam of light. It starts at the top left of the screen and zooms through the rows down to the opposite corner quicker than the human eye can see, but only for every other row. It then does this a second time for the alternate rows. This combination of two passes creates what appears to be a single frame, hence “interlaced.” The problem arises because digital formats are not interlaced. The two passes are locked together as one, to create an artificial single frame, but of course the original TV image is actually two passes, 1/50th second apart. For fast-moving subjects there’s a difference between how it looks on the first pass and how it looks on the second, but the digital migration blurs these two together. Ideally therefore we have to emulate rather than migrate, ie. in the space year 2100 someone will have to emulate an interlaced system, to view the original images properly. But that might not be possible. It would be easier and cheaper to decide just to live with the blur (significance, again).
  • MS also pointed out that lossless migration is easier to automate than lossy. Most re-encoding of moving image files takes up real time, if not more (eg a two hour movie takes at least two hours to re-encode). If you have to watch the movie to check that the lossy quality is still acceptable then that’s yet another two hours gone. But with lossless encoding you don’t have to check for quality, by definition.

All the presentations are on the DPC website.