You might think that word processed documents would be simple to preserve, but in fact they are not. Even the simplest documents contain tabs, bullets, indents, images, URL links, font changes, quotes, section headings, endnotes, embedded active content (such as spreadsheet cells). You could save the whole thing as a simple plain text file, but lose it all.
Yet at their core, word processed documents are too simple. They are flat, by which I mean non-hierarchical. Sections follow each other in sequence, heading and text and heading and text, but ultimately the word processing application only sees these sections in terms of their appearance. Word itself doesn’t understand that your fourth paragraph is a sub-division of the first paragraph. It is not a database. But 75 years from now, it is the structure of the document that people will be interested in, not the appearance. A word processing application takes your content and makes it look pretty.

4 comments
Comments feed for this article
January 18, 2008 at 12:20 pm
Wouter Kool
1. Word processors files sometimes do contain structure. In MS Word it is possible to use heading styles to generate a hierarchical table of contents. Admitted, it is not the measure of structure you might find in, for instance, an xml document.
When you convert the file to a more preservable format, for instance by printing to pdf, you risk loosing structural information.
2. 75 years from now art historians might be interested in document formatting…
January 19, 2008 at 9:12 pm
alanake
1. Yes, heavyweight word processors do understand some presentational structure, and I agree that this wold be lost in a conversion to XML.
2. Very possibly! But we have to pay for digital preservation now, and it’s not cheap. If we work for an organisation that produces (say) 10,000 records a year worthy of permanent archival preservation, then that’s a lot of free data we’re giving the art historians in 2083. It might be more cost effective to convert everything to XML and just save a handful of JPEG images to show “what an old document of 2008 looked like.”
A lot depends on what we understand our Designated Commnity of users to be. If we define the DC user group as including historians of presentational formatting, then the answer is Yes, we need to try to preserve all behavioural features of our records, and spend lots of money. If we however define our DC user group as just being bureaucrats within our own organisation looking for old data which they can import into the contemporary systems of their own timeframe, then we should just export everything to XML, and save our organisation a great deal of effort and money.
January 22, 2008 at 12:56 pm
Wouter Kool
Hi Alan,
I work at a national library (KB) and libraries have different designated communities from archives. So this explains the difference in outlook.
By the way, interesting blog! I am working on a migration project at KB. Perhaps we could exchange some of our findings.
January 22, 2008 at 8:55 pm
alanake
You work at the Koninklijke Bibliotheek?! – you guys rule! I am still battling my way through the recent RAND report about KB (hey, it’s over 140 pages long so it’ll take a bit of time). I don’t think I could share as much with you as you could with me – we’re still at a very early stage here, I’m just tasked to research the field- but you’re welcome to email me if you like.