The Koninklijke Bibliotheek in the Netherlands has produced a report Evaluating File Formats for Long-Term Preservation, available here, which introduces an evaluative scheme for assessing the fitness of a file format for preservation, and which then applies this scheme to two example formats, specifically MS Word 97-2003 doc format and PDF/A. Of course, identifying the winner of these two particular formats is easy (it might have been more interesting to see a closer contest such as ODF vs PDF/A) but it’s still an interesting exercise. The report was written by Judith Rog and Caroline van Wijk.

The scheme

Each file format is awarded a score on a particular criterion, such as “adoption: world wide usage” or “robustness: support for file corruption detection” and so on. The scores are weighted and then added together to give a total score. This total score then provides a quantifiable evaluation of how useful the format is as a way to preserve digital information for the long term.

The KB stress that their scoring and weighting “might be very specific to the KB,” espcially that the KB does not need editing functionality after publication, nor is it the main point of distribution for much of what it collects. The particular marks awarded by the KB reflect the KB’s own assessments: a different organisation might award different individual marks, resulting in a different total score.

The overall scheme is excellent, and really our sector needs to undertake this sort of assessment on all possible archive formats, and share our scorings.

MS Word 97-2003 vs PDF/A: a commentary on the results

The KB report does not comment on the results, so let’s do that ourselves. KB’s scorings resulted in doc format achieving a 22% score and PDF/A achieving 89%. PDF/A is therefore the winner, which we all already knew, but it is nice to see it quantified.

This overall score, however, hides some interesting individual criterion scores.

Firstly, PDF/A and doc both scored zero on a number of criteria. Both formats are equally rubbish on (for instance) robustness against a zero point of failure.

Secondly, there are a number of criteria where both formats have the same score, such as worldwide usage (both scored 2 out of 4) or backward compatibility (both scoring 2 out of 2).

Thirdly, there are some criteria where I think doc has been judged rather harshly. Doc is awarded 0 out of 8 on the “not dependent on specific hardware” criterion, for example, and similarly another 0 out of 8 for “not dependent on specific operating systems.” If by that the KB mean “dependent on the operating system used by 90% of the world’s desktop computers” well then, yes, it is dependent. But actually, doc is not that dependent on Windows, either, because you can buy an Apple Mac version of Word, which will open Word 97-2003 files ok. The OpenOffice suite on my home PC can open these files ok too. So we’re probably talking 95% of the world’s desktops. A score of absolute zero therefore seems brutal for a format which has a reasonable chance of being openable by the first random person you meet on the street.

Is PDF/A any good?

This is a wholly different question, one for a different blog posting. The report only claims PDF/A is a better long term preservation format than Word 97-2003. It doesn’t say that PDF/A is better than (say) PDF 1.6, which would be a closer battle. To their credit, the report’s authors point out that at least one digital preservation institution has rejected PDF/A.