And by long term, I mean after the end of technological civilisation, as explored in my earlier post here.

Richard Heinberg’s Post-Carbon Institute blog has a post about the importance of librarians and archivists in trying to keep cultural knowledge alive after the power goes out.  It’s a bit doomladen, even for me, but he is utterly right in saying that “digitization represents a huge bet on society’s ability to keep the lights on forever.” Read his post here.

The excellent Low Tech Magazine has an article looking at the monster footprint of digital technology, pointing out that digitisation absorbs far more power than we like to think, and that the seemingly inevitable rise of digital resources has only been made possible by cheap energy.

RODA

RODA

RODA (Repository of Authentic Digital Objects) is a Portuguese initiative to preserve government authentic digital objects. It is based on Fedora Commons, and supports the preservation of text documents, raster images, relational databases, video and audio. It runs in Java on a suitable browser. RODA’s core preservation stragey is migration, but it keeps the original representation too, so it should be OK to open old files on emulated systems.

It’s in its final stages of preparation now; a demo is available at http://roda.di.uminho.pt/?locale=en#home. I’ve created the screengrabs below myself while exploring the demo.

My notes are very brief. If you go along to the demo you will discover that two of the PDF documents preseved there are papers explaining more about the principles, systems and strategy behind the RODA project.

RODA is OAIS-compliant, so let’s run through this in OAIS order.

The SIP: this comprises the digital original and its metadata, all inside a METS envelope which is then zipped. Preservation metadata is a PREMIS record and descriptive metadata is in a segment of EAD. Technical metadata would also be nice but RODA’s creators say it “is not mandatory as is seldom created by producers.”

Files included in the SIP are accompanied by checksums and are checked for viruses.  Neatly, there are a number of ways that producers can create SIPs, one of which is a dedicated app called RODA-in.

Ingest. The system logs all SIPs which are in progress

Ingest. The system logs all SIPs which are in progress

Files in non-approved preservation formats (eg JPGs) are then normalised into formats which are approved (eg TIFFs). At that point they become AIPs. Approved formats are PDF/A for text (and for powerpoint presentations too, to judge from the examples on the demo), TIFF for images, MPEG-2 for video, WAV for audio, and DBML, this last one being an XML schema devised by the RODA team themselves for databases. Files in other formats are normalised by going through a normalisation plugin; “plugins can easily be created to allow ingestion of other formats not in the list.”

The AIP: if the archivist approves the SIP, and if it contains a normalised representation, then it becomes an AIP, and the customer can either search for it (simple search or a smart-looking advanced search) or browse the classification tree. The customer can view descriptive metadata, preservation metadata, previews of the data (depending on what sort of data it is) and the data itself.

Preservation metadata can be viewed as a timeline

Preservation metadata can be viewed as a timeline

An AIP. This is for a series of images; text documents, sound files etc all look different

An AIP. This is for a series of images; text documents, sound files etc all look different

Previews of specific images in the AIP

Previews of specific images in the AIP

The photo and book-style previews are beautiful. I never knew Portugal looked like this🙂

Security: currently the demo is open, but when it’s finally in action all users will be authenticated prior to accessing the repository, and all user actions will be logged. No anonymous users will be allowed. All preservation actions, such as format conversions, are likewise recorded. Permissions can be fine-tuned so that they apply from repository level all the way down to individual data objects. If a user does not have permission to view a specific item then it will not show in their search results.

And at the end of it all, the system can create stats!

The Administrator account can see stats

The Administrator account can see stats

One thing which immediately strikes is the clean finish to its user interface, the RODA WUI layer (RODA Web User Interface). Very, very cool.

The Portuguese team has clearly put in a great deal of time and skill here.  The project team is comprised of the Portuguese National Archives who carried out archiving consulting and development, the University of Minho which did the software engineering consulting, Assymetric Studios with design, the IDW with hardware, and Keep Solutions with maintenance and support.

My thanks to Miguel Ferreira of the University of Minho for answering my questions about RODA.

The WARC format for web archiving is now ISO 28500:2009. The format is used by the Internet Archive.

Here’s the release from the Library of Congress:

The International Internet Preservation Consortium is pleased to announce the publication of the WARC file format as an international standard: ISO 28500:2009, Information and documentation — WARC file format.  [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most appropriate ways to collect and keep track of World Wide Web material using web-scale tools such as web crawlers. At the same time, these organizations were concerned with the requirement to archive very large numbers of born-digital and digitized files. A need was for a container format that permits one file simply and safely to carry a very large number of constituent data objects (of unrestricted type, including many binary types) for the purpose of storage, management, and exchange. Another requirement was that the container need only minimal knowledge of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is an extension of the ARC format [http://www.archive.org/web/researcher/ArcFileFormat.php ], which has been used since 1996 to store files harvested on the web. WARC format offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, either retrieved by HTTP or another protocol.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium [http://netpreserve.org/ ], whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC Standards Working Group put forward to ISO TC46/SC4/WG12 a draft presenting the WARC file format. The draft was accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener, collaborated closely with IIPC experts to improve the original draft. The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the WARC format. It will help web archiving entering into the mainstream activities of heritage institutions and other branches, by fostering the development of new tools and ensuring the interoperability of collections. Several applications are already WARC compliant, such as the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the WARC tools [http://code.google.com/p/warc-tools/ ] for data management and exchange, the Wayback Machine [http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX [http://archive-access.sourceforge.net/projects/nutch/ ] and other search tools [http://code.google.com/p/search-tools/ ] for access. The international recognition of the WARC format and its applicability to every kind of digital object will provide strong incentives to use it within and beyond the web archiving community.

———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

The reason why I’ve been away from digi preservation for so long is that I’ve been managing the move of our paper archives from one repository to another.  The move itself has gone more smoothly than I dared to hope: everything happened on schedule, the ICT didn’t let me down, all the boxes fitted their new locations so my sums must have been accurate enough… it’s taken three weeks to move a mile of paper and parchment archives.

Our understanding is that we’re the first people to control a UK local authority repository move with barcodes. It’s taken 2.5 years of preparation, mainly spent in getting all our barcode data onto CALM, but the result this week is that all we had to do was a massive zap of all the barcodes in the building (that’s taken 48 hours), upload the data into CALM’s locations module, and voila! – we now know where everything is.

Boxes hundreds of em

Above: boxes in the new repository.  At the old record office we had boxes of different sizes and formats scattered throughout the building. In the new repository we have been very strict in storing boxes purely by format even if it means splitting collections up. We’re relying totally on the barcodes to find them.

Cantilevers

Rolled maps in linen bags. Every single individual package, whether it’s a roll, a box, a freestanding volume or a folder in a drawer, has its own barcode.

James Dear

Even the 19th century portrait which we have on deposit has its own barcode!

Zapping

Here one of my colleagues is zapping the boxes on their new shelves. First we zap the shelf (all the shelves have their own barcodes) and then we zap the items on it. This raw data gets imported into Excel where lookup tables replace the numbers with human readable information (eg replacing “L012345” with “Bay R6 shelf D”). Then it all goes into CALM’s locations module, so that it links automatically with the documents’ catalogue entries.

The methodology took us months to work out, followed by two years of repackaging work and sticking barcodes on everything, just to result in 48 hours of zapping in the new repository.

Here’s some details about our barcode methodology in the National Archives’s RecordKeeping magazine (we’re on page 36).

It’s been a long, long project but it’s all gone smoothly and I feel rather chuffed to have managed the move in a new way. Back to digi preservation soon!

Well, the short answer is 2037, plus or minus a few years.

That’s not a flippant answer, either. All forms of digital preservation really will stop around 2037, unless some kind of energy supply breakthrough happens.

How do we arrive at this date?

Let’s go thorough this step by step.

Technological collapse

The fundamental premise behind what follows is that digital preservation cannot survive the collapse of our technological civilisation. If you disagree with that premise, that’s fine, but you might as well stop reading now. It is always good to be clear about our premises before we begin. If you agree with the premise, then let’s carry on.

Paper records, if stored or hidden in a substantial box, can last centuries without any active preservation measures being undertaken. The civilisation which created those paper records might collapse, but the box could survive. A future civilisation can then discover the box, realise there is a message-bearing medium inside, and work out what it says. (It’s even better if the records are on stone. There’s a three thousand year gap between the Assyrian messages at Behistun and the 19th century European explorers who mapped and transcribed them, but that gap did not stop linguists from deciphering the cuneiform messages.)

Digital media have much shorter timespans. It is doubtful that a hard drive will be able to spin and deliver its data a few centuries after our society has collapsed. Technologically-dependent data storage therefore cannot survive massive societal collapse in the same way that non-dependent data can. 

We have arrived at the first possible answer to our question, which we will call answer A1:

A1: Digital preservation will come to an end when technological civilisation comes to an end.

Now let’s start to pin this down. When will technological civilisation come to an end?

Modern computing is wholly dependent upon hardware which in turn is wholly dependent upon fossil fuels for its creation and maintenance. The servers or CDs which preserve our data incorporate plastics which have been refined from crude oil supplied by OPEC. The dust-free clean rooms in which the chips are made are kept clean by energy derived from burning hydrocarbons. The finished computers are distributed globally by diesel-burning ships, which deliver them to ports from which the machines are then placed onto diesel-fuelled trucks for final distribution to warehouses and shops. The world’s ICT infrastructure is maintained by people who get to and from work in vehicles powered by petrol. Without crude oil none of this would happen.

The oil basis of modern ICT is an issue which gets raised from time to time. In December 2007 New Scientist reported that “computer servers are at least as great a threat to the climate as SUVs or the global aviation industry,” due to their carbon footprint. A 2004 United Nations study showed that the construction of an average 24-kilogram computer and 27-centimetre monitor requires at least 240 kilograms of fossil fuel, 22 kilograms of chemicals and 1,500 kilograms of water – or 1.8 tons in total, the equivalent of a sports utility vehicle.

Take away all this oil, plastic, petrol and diesel, and the world’s ICT structure becomes unsustainable. Motherboards become trickier to manufacture if you only have wood and brass. Gathering together the components and then distributing the finished machines becomes harder if you are dependent on sailing ships, horse-drawn carriages and barges for transport.

We can now refine our earlier answer. If we agree that digital preservation will come to an end when modern technological civilisation comes to an end, and if we then agree that modern technology is currently wholly dependent on oil and oil-derived plastics for its maintenance, then we arrive at the following statement:

A2: Digital preservation will end when the oil supply comes to an end.

But when will the oil supply come to an end? Never, in a sense, because at some point it will become too uneconomical for the world to drill out the last remaining drops. There will always be some oil left in the earth. Sadly that’s no help to us, because we will be back in the stone age by then.

A better question is, when will the oil supply start to run out?  – because that’s the point at which civilisation crashes; that’s the point at which any particular country can only increase its own oil and plastic by taking away oil and plastic from another country. And from that date, year on year, there will be less oil and plastic than the year before.

I’m no geologist, so let’s go to the experts on this one. The EIA (Energy Information Administration) is the energy data arm of the US government. In 2004 the EIA published a report on Long Term Oil Prospects, which looked at exactly this question. The report’s authors considered a number of likely scenarios for both (a) the total amount of oil in the ground and (b) the increase in demand for oil as time progresses. Then they mapped out all these scenarios.

This graph shows the three main scenarios, with the central one being the likeliest, as it is based on a world total oil production figure of about 3 trillion barrels of oil, which is the US Geological Survey’s assessment. The overall curve has a sharkfin shape. World oil supply rises upwards with a 2% annual growth rate until it peaks and then suddenly falls, when the world’s oil wells cannot meet demand. The peak comes in 2037.

So, for our purposes we can say that, as digital preservation will come to an end when the oil supply comes to an end, and as the oil supply will come to an end in 2037, we can then say that:

A3: Digital preservation will come to an end in 2037.

Certainly we will have bigger problems in 2037 than simply digital preservation. One problem which springs to my mind, as a UK citizen, is starvation. Much of our food in the UK is grown overseas (using fossil fuel-based fertiliser) and then shipped across. U-boat warfare almost starved Britain in the 1940s, yet back then we had more land under arable cultivation, a smaller population to feed, and a bigger proportion of our population was involved in agriculture. When the world’s shipping stops in 2037 digital preservation will be less of a priority than personal survival. As Peter Goodchild recently wrote, when the oil supply stops “our descendants will be smashing computers to get pieces of metal they can use as arrowheads.”

That doesn’t sound very optimistic.

No, but the oil might not run out in 2037. It might run out later (although the crash will be bigger).

Some people, such as the Peak Oil crowd who hang around at The Oil Drum, think the oil supply might be running out just about now, but I’m no energy expert so I’ll stick with the US government’s EIA on this one.

The longer we have, the better are our prospects at longer term digital archiving, because it gives the world more time to create and roll out a new ICT structure, one which doesn’t use oil-derived plastics, or depend on oil for distribution, power and maintenance. On the other hand, the shorter we have, the worse our prospects will be.

Local authority archive services in the UK tend to have very good collections of parish records and local government records, but have poorer collections of business archives. In my experience business records are only deposited when the business itself has been liquidated, or been taken over, or when it has simply vanished from a building and another company has moved into the premises to discover a heap of old paper and ledgers. I have taken part in at least one instance of “rescue” archiving of this nature, when our team waded through mud inside a soon-to-be-demolished warehouse, picking 19th century volumes out of the dirt.

This kind of rescue activity only happens because it’s easier for the new company, or for the official receivers, to persuade an archives service to sort out all this old paper than it would be for them to sort it out themselves.

But the reverse is true for digital archives. When a modern business goes bust, many of its records will exist only in electronic form. The receivers’ primary job is to identify and sell the assets of the business and dispose of the remainder. For most businesses, the only electronic record which would have value as a saleable asset would be the list of its customers’ contact details. Privacy policies now explicitly state that customers’ information may be sold in the event of bankruptcy. Here’s one I found at random:

Business sale
If [Name of company] Ltd or any of its lines of business is sold, pledged or disposed of as a going concern, whether by merger, sale of assets, bankruptcy or otherwise, then the user database of [the company] could be sold as part of that transaction and all User Information accessed by such successor or purchaser.”

Customers’ details, therefore, are not going to be handed over free to the nearby archives office. They will instead be sold to the highest commercial bidder. What about all the other electronic files – the personnel records, publicity photos, advertising material, accounts?

I suspect that most official receivers would treat the ICT hardware as either another asset for resale (if the kit was recent or high spec), or they would dispose of it to professional ICT equipment salvagers, who would ship it over to the developing world. Either way the data on the computers would get wiped. It is unlikely that they would contact their local archivists to suggest we come over with a truck to remove all the old servers and PCs. It also seems unlikely that they would wait for us to turn up with portable hard drives, power up all the old equipment, and work out what data we want to transfer across from dozens of separate servers and desktop machines.

The inheriting organisation will always be under pressure to take the easiest and cheapest way to dispose of a predecessor’s assets, which in practice probably means that data will be wiped and the hardware sold on. We are therefore looking at potentially a very large loss of historical business data.

Bankruptcy has been recognised in the past as a threat to digital preservation: read this from “Requirements for Digital Preservation Systems: A Bottom-Up Approach” in D-Lib magazine, back in November 2005:

Organizational Failure. The system view of digital preservation must include not merely the technology but the organization in which it is embedded. These organizations may die out, perhaps through bankruptcy, or their missions may change. This may deprive the digital preservation technology of the support it needs to survive. System planning must envisage the possibility of the asset represented by the preserved content being transferred to a successor organization, or otherwise being properly disposed of. For each of these types of failure, it is necessary to trade off the cost of defense against the level of system degradation under the threat that is regarded as acceptable for that cost.

Paper records can easily survive organizational failure, but electronic records will have a much tougher time.

These thoughts have been prompted by the slowdown in Western economies and the possibility of a forthcoming global recession. Some businesses have already gone bust, and the omens are not good for many others. For 99% of these companies, their electronic records will evaporate with the death of the business.

It’s possible that I’m worrying unnecessarily here. I would love to hear if anyone knows of instances where local archive repositories have been offered electronic data from former businesses which have gone bankrupt, rather than from live, viable companies.

Yesterday I visited Gloucestershire Archives to have a look at their GAIP (Gloucestershire Archives Ingest Package) software.

GAIP is a little Perl app which is open source and nicely platform independent (yesterday we saw it in action on both XP and Fedora). Using GAIP, you can take a digital file, or a collection of files, and create a non-proprietary preservation version of it, which is then kept in a .tgz file containing the preservation version, the original, the metadata, and a log of alterations. Currently it works with image files, so that GAIP can create a .tgz containing the original bmp (for instance) as well as the png which it has created. GAIP can then also create a publication version of the image, usually a JPEG. Gloucestershire Archives are intending to expand GAIP to cover other sorts of files too: it depends on what sorts of converters they can track down.

At present GAIP uses a command line interface which isn’t terribly friendly, but this can easily be improved.

From my point of view, I was glad to have a play with GAIP as it has rekindled my optimism about low-level digital preservation. I have been in a sulk for a couple of months because the only likely solutions seemed to be big-budget applications set up by (and therefore controlled by) national-level organisations. GAIP however is a ray of local light, a sign that UK local authorities might be able to develop in-house and low budget solutions which are realistic to our own specific contexts.


MLA East of England has published the report on Phase 2 of its Digital Preservation Regional Pilot Project (DARP 2). The report is available as a PDF here. Phase 2 was carried out by Bedfordshire County Council over the period September 2007-June 2008.

The project is of great use to UK local authority record offices, such as the one I work for, because it assesses the real world situation where outside organisations create digital records and then deposit them with local archive services. This is a different situation from that experienced by national archive organisations, which by and large deal with fewer record-creating organisations, and which therefore have more say over the sorts of records created. A UK local authority archives service typically deals with thousands of separate organisations and individuals, and has little or no say over file formats.

The aim of the DARP 2 project was therefore to survey a sample of these “typical depositors” to establish the reality behind this concern. Are organisations creating large numbers of electronic records for long term preservation, or are they still reliant on paper? How are they using digital records?

Bedfordshire and Luton Archives Service surveyed a range of organisations, including Parochial Church Councils, magistrates courts, town councils, parish councils, state and independent schools, and some businesses and charities. The survey was carried out with a questionnaire and with a follow-up interview.

Summary of DARP 2’s interesting results

“The overall picture was one of all or nothing in terms of understanding.” This is probably just as true of colleagues within the archives sector… Digital preservation sadly is not a subject which people can pick up a working knowledge of in their day to day activities, nor does it crop up very often in the media. You are either interested in it (in which case you will read up lots) or you are not (in which case you will know nothing). It is not like (say) gardening, where there is a whole spectrum of levels of involvement, from just weeding right through to plant breeding.

Most organisations still use paper. Some bodies stated that this was due to issues concerning the admissibility of digital records in court. Other organisations depend entirely on volunteers using home computers, using unsophisticated filing systems on old equipment. At least one organisation stated that electronic records were kept purely as backup for paper. Certainly it seems that many organisations regard paper as the best long term solution: more than half of all respondents archived their emails by printing them out and filing the hard copies.

The report itself states that “paper is still the medium of choice for record keeping – 85% of the bodies surveyed are printing out digital files… although computers offer very creative means of generating ways of populating and decorating the blank page, they are tending to be seen as tools for manipulating and storing documents not as the final means of storing and managing records.”

Only ten replies (out of 26) responded to the question concerning migration, and three respondents even stated that they did not understand the concept.

There is a problem with digital record keeping in state schools. It is remarkable that DARP found it difficult even to state schools in the project, or even to work out who at a school was responsible for record keeping.

No organisation thought that the record office would fail to deal with digital records.

The most popular backup medium was CD-R, closely followed by memory stick.

The problems of domain crawls

  • The legal approach is usually a risk-based one, ie. harvest the content and then take it down if the creator objects. If the creator starts a court case then the costs of this approach could be very expensive.
  • The approach has no basis in law. Copying any site without the content creator’s permission is an illegal act.
  • There is no guarantee that a site discoverable today in the repository will still be there in the future, as the content creator may have requested its removal.
  • There is no 1:1 engagement with the content creators, frequently no engagement at all.
  • The crawl ignores lots of valauble content. Not everything relevant to UK web history actually has a “.uk” domain.
  • Domain crawls are slow, and miss much of the web’s fleeting, at-risk, or semantic content.

The problems of active selection

  • The selective approach is usually a permissions-based one, ie. approach the content creator first and ask for permission to archive. But this demands engagement with the creator, which is time-consuming, and which in turn drives the policy to become even more selective than what the repository may originally have envisaged. So the result is usually small-scale harvesting.
  • Creators may not understand the purpose or urgency of archiving.
  • Creators may say No, in which case the efforts made to engage with them have been fruitless.
  • Many sites are not selected.
  • The repository may not have the resources to re-evaluate selection decisions. Therefore, once a site has been rejected, it may continue to be rejected, even though its content has changed.
  • The repository needs to implement a policy on whether to continue archiving a site in which the content accruals stop being useful. But this constant overview over the harvesting schedule requires resources.
Follow

Get every new post delivered to your Inbox.