You are currently browsing the category archive for the ‘Alans thoughts’ category.

And by long term, I mean after the end of technological civilisation, as explored in my earlier post here.

Richard Heinberg’s Post-Carbon Institute blog has a post about the importance of librarians and archivists in trying to keep cultural knowledge alive after the power goes out.  It’s a bit doomladen, even for me, but he is utterly right in saying that “digitization represents a huge bet on society’s ability to keep the lights on forever.” Read his post here.

The excellent Low Tech Magazine has an article looking at the monster footprint of digital technology, pointing out that digitisation absorbs far more power than we like to think, and that the seemingly inevitable rise of digital resources has only been made possible by cheap energy.

Well, the short answer is 2037, plus or minus a few years.

That’s not a flippant answer, either. All forms of digital preservation really will stop around 2037, unless some kind of energy supply breakthrough happens.

How do we arrive at this date?

Let’s go thorough this step by step.

Technological collapse

The fundamental premise behind what follows is that digital preservation cannot survive the collapse of our technological civilisation. If you disagree with that premise, that’s fine, but you might as well stop reading now. It is always good to be clear about our premises before we begin. If you agree with the premise, then let’s carry on.

Paper records, if stored or hidden in a substantial box, can last centuries without any active preservation measures being undertaken. The civilisation which created those paper records might collapse, but the box could survive. A future civilisation can then discover the box, realise there is a message-bearing medium inside, and work out what it says. (It’s even better if the records are on stone. There’s a three thousand year gap between the Assyrian messages at Behistun and the 19th century European explorers who mapped and transcribed them, but that gap did not stop linguists from deciphering the cuneiform messages.)

Digital media have much shorter timespans. It is doubtful that a hard drive will be able to spin and deliver its data a few centuries after our society has collapsed. Technologically-dependent data storage therefore cannot survive massive societal collapse in the same way that non-dependent data can. 

We have arrived at the first possible answer to our question, which we will call answer A1:

A1: Digital preservation will come to an end when technological civilisation comes to an end.

Now let’s start to pin this down. When will technological civilisation come to an end?

Modern computing is wholly dependent upon hardware which in turn is wholly dependent upon fossil fuels for its creation and maintenance. The servers or CDs which preserve our data incorporate plastics which have been refined from crude oil supplied by OPEC. The dust-free clean rooms in which the chips are made are kept clean by energy derived from burning hydrocarbons. The finished computers are distributed globally by diesel-burning ships, which deliver them to ports from which the machines are then placed onto diesel-fuelled trucks for final distribution to warehouses and shops. The world’s ICT infrastructure is maintained by people who get to and from work in vehicles powered by petrol. Without crude oil none of this would happen.

The oil basis of modern ICT is an issue which gets raised from time to time. In December 2007 New Scientist reported that “computer servers are at least as great a threat to the climate as SUVs or the global aviation industry,” due to their carbon footprint. A 2004 United Nations study showed that the construction of an average 24-kilogram computer and 27-centimetre monitor requires at least 240 kilograms of fossil fuel, 22 kilograms of chemicals and 1,500 kilograms of water – or 1.8 tons in total, the equivalent of a sports utility vehicle.

Take away all this oil, plastic, petrol and diesel, and the world’s ICT structure becomes unsustainable. Motherboards become trickier to manufacture if you only have wood and brass. Gathering together the components and then distributing the finished machines becomes harder if you are dependent on sailing ships, horse-drawn carriages and barges for transport.

We can now refine our earlier answer. If we agree that digital preservation will come to an end when modern technological civilisation comes to an end, and if we then agree that modern technology is currently wholly dependent on oil and oil-derived plastics for its maintenance, then we arrive at the following statement:

A2: Digital preservation will end when the oil supply comes to an end.

But when will the oil supply come to an end? Never, in a sense, because at some point it will become too uneconomical for the world to drill out the last remaining drops. There will always be some oil left in the earth. Sadly that’s no help to us, because we will be back in the stone age by then.

A better question is, when will the oil supply start to run out?  – because that’s the point at which civilisation crashes; that’s the point at which any particular country can only increase its own oil and plastic by taking away oil and plastic from another country. And from that date, year on year, there will be less oil and plastic than the year before.

I’m no geologist, so let’s go to the experts on this one. The EIA (Energy Information Administration) is the energy data arm of the US government. In 2004 the EIA published a report on Long Term Oil Prospects, which looked at exactly this question. The report’s authors considered a number of likely scenarios for both (a) the total amount of oil in the ground and (b) the increase in demand for oil as time progresses. Then they mapped out all these scenarios.

This graph shows the three main scenarios, with the central one being the likeliest, as it is based on a world total oil production figure of about 3 trillion barrels of oil, which is the US Geological Survey’s assessment. The overall curve has a sharkfin shape. World oil supply rises upwards with a 2% annual growth rate until it peaks and then suddenly falls, when the world’s oil wells cannot meet demand. The peak comes in 2037.

So, for our purposes we can say that, as digital preservation will come to an end when the oil supply comes to an end, and as the oil supply will come to an end in 2037, we can then say that:

A3: Digital preservation will come to an end in 2037.

Certainly we will have bigger problems in 2037 than simply digital preservation. One problem which springs to my mind, as a UK citizen, is starvation. Much of our food in the UK is grown overseas (using fossil fuel-based fertiliser) and then shipped across. U-boat warfare almost starved Britain in the 1940s, yet back then we had more land under arable cultivation, a smaller population to feed, and a bigger proportion of our population was involved in agriculture. When the world’s shipping stops in 2037 digital preservation will be less of a priority than personal survival. As Peter Goodchild recently wrote, when the oil supply stops “our descendants will be smashing computers to get pieces of metal they can use as arrowheads.”

That doesn’t sound very optimistic.

No, but the oil might not run out in 2037. It might run out later (although the crash will be bigger).

Some people, such as the Peak Oil crowd who hang around at The Oil Drum, think the oil supply might be running out just about now, but I’m no energy expert so I’ll stick with the US government’s EIA on this one.

The longer we have, the better are our prospects at longer term digital archiving, because it gives the world more time to create and roll out a new ICT structure, one which doesn’t use oil-derived plastics, or depend on oil for distribution, power and maintenance. On the other hand, the shorter we have, the worse our prospects will be.

Local authority archive services in the UK tend to have very good collections of parish records and local government records, but have poorer collections of business archives. In my experience business records are only deposited when the business itself has been liquidated, or been taken over, or when it has simply vanished from a building and another company has moved into the premises to discover a heap of old paper and ledgers. I have taken part in at least one instance of “rescue” archiving of this nature, when our team waded through mud inside a soon-to-be-demolished warehouse, picking 19th century volumes out of the dirt.

This kind of rescue activity only happens because it’s easier for the new company, or for the official receivers, to persuade an archives service to sort out all this old paper than it would be for them to sort it out themselves.

But the reverse is true for digital archives. When a modern business goes bust, many of its records will exist only in electronic form. The receivers’ primary job is to identify and sell the assets of the business and dispose of the remainder. For most businesses, the only electronic record which would have value as a saleable asset would be the list of its customers’ contact details. Privacy policies now explicitly state that customers’ information may be sold in the event of bankruptcy. Here’s one I found at random:

Business sale
If [Name of company] Ltd or any of its lines of business is sold, pledged or disposed of as a going concern, whether by merger, sale of assets, bankruptcy or otherwise, then the user database of [the company] could be sold as part of that transaction and all User Information accessed by such successor or purchaser.”

Customers’ details, therefore, are not going to be handed over free to the nearby archives office. They will instead be sold to the highest commercial bidder. What about all the other electronic files – the personnel records, publicity photos, advertising material, accounts?

I suspect that most official receivers would treat the ICT hardware as either another asset for resale (if the kit was recent or high spec), or they would dispose of it to professional ICT equipment salvagers, who would ship it over to the developing world. Either way the data on the computers would get wiped. It is unlikely that they would contact their local archivists to suggest we come over with a truck to remove all the old servers and PCs. It also seems unlikely that they would wait for us to turn up with portable hard drives, power up all the old equipment, and work out what data we want to transfer across from dozens of separate servers and desktop machines.

The inheriting organisation will always be under pressure to take the easiest and cheapest way to dispose of a predecessor’s assets, which in practice probably means that data will be wiped and the hardware sold on. We are therefore looking at potentially a very large loss of historical business data.

Bankruptcy has been recognised in the past as a threat to digital preservation: read this from “Requirements for Digital Preservation Systems: A Bottom-Up Approach” in D-Lib magazine, back in November 2005:

Organizational Failure. The system view of digital preservation must include not merely the technology but the organization in which it is embedded. These organizations may die out, perhaps through bankruptcy, or their missions may change. This may deprive the digital preservation technology of the support it needs to survive. System planning must envisage the possibility of the asset represented by the preserved content being transferred to a successor organization, or otherwise being properly disposed of. For each of these types of failure, it is necessary to trade off the cost of defense against the level of system degradation under the threat that is regarded as acceptable for that cost.

Paper records can easily survive organizational failure, but electronic records will have a much tougher time.

These thoughts have been prompted by the slowdown in Western economies and the possibility of a forthcoming global recession. Some businesses have already gone bust, and the omens are not good for many others. For 99% of these companies, their electronic records will evaporate with the death of the business.

It’s possible that I’m worrying unnecessarily here. I would love to hear if anyone knows of instances where local archive repositories have been offered electronic data from former businesses which have gone bankrupt, rather than from live, viable companies.

Yesterday I visited Gloucestershire Archives to have a look at their GAIP (Gloucestershire Archives Ingest Package) software.

GAIP is a little Perl app which is open source and nicely platform independent (yesterday we saw it in action on both XP and Fedora). Using GAIP, you can take a digital file, or a collection of files, and create a non-proprietary preservation version of it, which is then kept in a .tgz file containing the preservation version, the original, the metadata, and a log of alterations. Currently it works with image files, so that GAIP can create a .tgz containing the original bmp (for instance) as well as the png which it has created. GAIP can then also create a publication version of the image, usually a JPEG. Gloucestershire Archives are intending to expand GAIP to cover other sorts of files too: it depends on what sorts of converters they can track down.

At present GAIP uses a command line interface which isn’t terribly friendly, but this can easily be improved.

From my point of view, I was glad to have a play with GAIP as it has rekindled my optimism about low-level digital preservation. I have been in a sulk for a couple of months because the only likely solutions seemed to be big-budget applications set up by (and therefore controlled by) national-level organisations. GAIP however is a ray of local light, a sign that UK local authorities might be able to develop in-house and low budget solutions which are realistic to our own specific contexts.

The problems of domain crawls

  • The legal approach is usually a risk-based one, ie. harvest the content and then take it down if the creator objects. If the creator starts a court case then the costs of this approach could be very expensive.
  • The approach has no basis in law. Copying any site without the content creator’s permission is an illegal act.
  • There is no guarantee that a site discoverable today in the repository will still be there in the future, as the content creator may have requested its removal.
  • There is no 1:1 engagement with the content creators, frequently no engagement at all.
  • The crawl ignores lots of valauble content. Not everything relevant to UK web history actually has a “.uk” domain.
  • Domain crawls are slow, and miss much of the web’s fleeting, at-risk, or semantic content.

The problems of active selection

  • The selective approach is usually a permissions-based one, ie. approach the content creator first and ask for permission to archive. But this demands engagement with the creator, which is time-consuming, and which in turn drives the policy to become even more selective than what the repository may originally have envisaged. So the result is usually small-scale harvesting.
  • Creators may not understand the purpose or urgency of archiving.
  • Creators may say No, in which case the efforts made to engage with them have been fruitless.
  • Many sites are not selected.
  • The repository may not have the resources to re-evaluate selection decisions. Therefore, once a site has been rejected, it may continue to be rejected, even though its content has changed.
  • The repository needs to implement a policy on whether to continue archiving a site in which the content accruals stop being useful. But this constant overview over the harvesting schedule requires resources.

deegantanner.jpg Review of and notes from Julien Masanes’s chapter on web archiving in Deegan and Tanner’s Digital Preservation (2006). This book also contains a chapter by Elisa Mason looking at some web archiving case studies. Available at Amazon. 

Masanes asserts that the web is a dynamic information space; web archiving consists of capturing a local, limited version of one small part of this space, and freezing it. This is well tricky, for various reasons. 

  • Much of the web’s content is dynamic: a URI can link not only to a simple flat page, but also to a virtual page, which is really a set of parameters which is then interpreted to generate the displayed content.
  • The actual dynamics of these dynamic pages evolve rapidly, because it is so easy for people to code up a new protocol and share it around.
  • The concept of a definitive or final version does not apply to the web: web documents are always prone to deletion or updating. Any preservation undertaken is therefore by its nature a sampling done at a specific time.
  • The navigation paths are embedded within the documents themselves, so a web archive has to be constructed in such a way that this mechanism still works for preserved pages.
  • Automated crawlers or spiders often take a few days to navigate large websites. This can lead to temporal incoherence, in that a page linked from the home page may be different from how it was when the home page itself was crawled.
  • Size. Size. Size. An organisation’s website can easily run into hundreds of thousands of pages, comprising billions of individual objects such as graphics etc.

Read the rest of this entry »

Time for a round-up:

  • Always keep the original bitstream of a digital resource, no matter what application it was created in, proprietary or not.
  • Never let the user consult the original. She can consult a copy. This copy does not have to be in the same format as the original, unless the user herself demands it.
  • When the time comes to migrate the data, migrate it to an open, XML-based format, if possible. But still keep the original.
  • Keep metadata separate from the original data object.
  • Store everything on a dedicated server, with backups, rather than offline media.

And remember the two laws of digital preservation, which follow irrefutably from the fact that no one has or will ever come back in time from the future to tell us what will work:

  • No one knows anything.
  • No one will ever know anything. 

That’s the position today, at least.

sl1.jpgPublisher: John Wiley & Sons (23 Jan 2004) ; ISBN-10: 0471453803. Available from Amazon.

OK, so it’s a book about digital security, not about digital preservation. But if there was a book on digital preservation as well written as this then I doubt we would have any problems in getting our message across. Well worth reading.

There are two particular aspects which jumped out as being indirectly relevant to digital preservation concerns, both to do with the interaction of humans with computers:

There is no such thing as a computer system; there are only computer-with-human systems. Well I’m paraphrasing Schneier there, but it’s the sort of thing he would say, and he argues that it is the case. It is pointless to buy a digital security package and then leave the password on a Post-it note gummed to the monitor. It is pointless to invest in 128-bit encryption if the password you choose will be your cat’s name. It is pointless to set up a cutting edge firewall if you pay your staff so little that they will be bribed by a guy in the pub to burn the data onto a CD anyway. Schneier is making the point that an ICT system, by itself, is meaningless: it exists in a world full of humans, and we need to make sure the human elements are as trustworthy as the technical ones. This strikes me as being indirectly relevant to digital preservation. We argue lots about technical aspects – emulation, migration, file formats, metadata, XML etc – but we need to train ourselves up in human pyschology and understand exactly how people will interact with our proposed systems.

Humans don’t do work on data; only progams do. (Another paraphrase there.) Schneier’s explicit point is about encryption, such as PGP. Very often you read statements like “Alice encrypts a message with Bob’s public key, which Bob can then decrypt because he has his own private key.” But in reality, nothing of the sort ever happens. Instead Alice presses a key on her computer. An application then encrypts the message. Nor does Bob decrypt. Instead he presses a key on his own computer, and the computer does the decrypt. Alice is trusting her computer, her OS and the app to do their job, and trusting that the encryption software company haven’t rigged up a backdoor. Bob, too, is trusting a whole load of people that he has never met, purely because he has bought their software.  

There is an analogy here with digital preservation, as Schneier’s point can be extrapolated across to migration and emulation. When someone says “we can emulate X on Y” what they actually mean is “there is company claiming that X can be emulated on Y, and I am trusting them.” Or: “there is a company claiming  that their software can automatically migrate 1,000,000 files from file format X to file format Y with no loss of information content, and I am trusting them.” Or: “there is a company claiming that their checksum software proves fixity in refeshing data, and I am trusting them.” Ultimately we do not trust the technology, we have to trust the people behind the technology.

Most creators of digital records do not care tuppence about the long term preservation of their documents, which is why people in the digi pres field continually try to raise awareness of the issues.

Which prompts a question – does successful emulation undermine our efforts? If the creators of records believe that someone 75 years from now will create a succesful emulator which will run Excel 2003 (say), then there is no pressure on them to create their records now in any other format, is there? Creators can carry on creating records in closed, proprietary formats, to their hearts’ content. Every new report of a successful emulation project is yet another nail in the coffin of trying to persuade creators to use different formats.

Picture this: 1

I have a cup of coffee. I write a message on the froth, by sprinkling cocoa powder in a pattern which spells some words. Even though I know the likely lifetime of this record will be just five minutes, I take the cup of coffee to the archive. I tell the staff to preserve my cup of coffee forever, because it might be needed 100 years from now for legal reasons. I tell them they must preserve the full functionality of the message in the coffee: its look and feel, its taste, its warmth, and the wording on top.

Picture this: 2

I am walking along the beach. I write a message in the wet sand with my toes. Even though I know the lifetime of this record will be a few days at most, I dig the sand carefully out from the beach, and carry it the archive. I tell the staff to preserve my sand forever, because it might be needed 100 years from now for legal reasons. I tell them they must preserve the full functionality of the message in the sand: its look and feel, the sound of the waves, the glint of the afternoon sunshine on the grains.

Picture this: 3

I am reading a cheap tabloid newspaper, printed on highly acidic pulp. I write a message on the margin, with an old felt-tip pen. Even though I know the lifetime of the record will be a decade at most, I rip it from the newspaper and take it to the archive. I tell the staff to preserve my newspaper message forever, because it might be needed 100 years from now for legal reasons. I tell them to preserve the full functionality of the message: its look and feel, the clarity of the felt-tip ink, the pure whiteness of the paper itself.

Picture this: 4

I am sitting at a computer. I write a message using a proprietary, closed format word processing program. Even though I know the lifetime of this record will be only a decade or two at most, I email it to the archive. I tell the staff there to preserve my electronic document forever, because it might be needed 100 years from now for legal reasons. I tell them they must preserve the full functionality of the message: its look and feel, its behavior, its working hyperlinks, its hidden structure.