You are currently browsing the monthly archive for June 2009.



RODA (Repository of Authentic Digital Objects) is a Portuguese initiative to preserve government authentic digital objects. It is based on Fedora Commons, and supports the preservation of text documents, raster images, relational databases, video and audio. It runs in Java on a suitable browser. RODA’s core preservation stragey is migration, but it keeps the original representation too, so it should be OK to open old files on emulated systems.

It’s in its final stages of preparation now; a demo is available at I’ve created the screengrabs below myself while exploring the demo.

My notes are very brief. If you go along to the demo you will discover that two of the PDF documents preseved there are papers explaining more about the principles, systems and strategy behind the RODA project.

RODA is OAIS-compliant, so let’s run through this in OAIS order.

The SIP: this comprises the digital original and its metadata, all inside a METS envelope which is then zipped. Preservation metadata is a PREMIS record and descriptive metadata is in a segment of EAD. Technical metadata would also be nice but RODA’s creators say it “is not mandatory as is seldom created by producers.”

Files included in the SIP are accompanied by checksums and are checked for viruses.  Neatly, there are a number of ways that producers can create SIPs, one of which is a dedicated app called RODA-in.

Ingest. The system logs all SIPs which are in progress

Ingest. The system logs all SIPs which are in progress

Files in non-approved preservation formats (eg JPGs) are then normalised into formats which are approved (eg TIFFs). At that point they become AIPs. Approved formats are PDF/A for text (and for powerpoint presentations too, to judge from the examples on the demo), TIFF for images, MPEG-2 for video, WAV for audio, and DBML, this last one being an XML schema devised by the RODA team themselves for databases. Files in other formats are normalised by going through a normalisation plugin; “plugins can easily be created to allow ingestion of other formats not in the list.”

The AIP: if the archivist approves the SIP, and if it contains a normalised representation, then it becomes an AIP, and the customer can either search for it (simple search or a smart-looking advanced search) or browse the classification tree. The customer can view descriptive metadata, preservation metadata, previews of the data (depending on what sort of data it is) and the data itself.

Preservation metadata can be viewed as a timeline

Preservation metadata can be viewed as a timeline

An AIP. This is for a series of images; text documents, sound files etc all look different

An AIP. This is for a series of images; text documents, sound files etc all look different

Previews of specific images in the AIP

Previews of specific images in the AIP

The photo and book-style previews are beautiful. I never knew Portugal looked like this 🙂

Security: currently the demo is open, but when it’s finally in action all users will be authenticated prior to accessing the repository, and all user actions will be logged. No anonymous users will be allowed. All preservation actions, such as format conversions, are likewise recorded. Permissions can be fine-tuned so that they apply from repository level all the way down to individual data objects. If a user does not have permission to view a specific item then it will not show in their search results.

And at the end of it all, the system can create stats!

The Administrator account can see stats

The Administrator account can see stats

One thing which immediately strikes is the clean finish to its user interface, the RODA WUI layer (RODA Web User Interface). Very, very cool.

The Portuguese team has clearly put in a great deal of time and skill here.  The project team is comprised of the Portuguese National Archives who carried out archiving consulting and development, the University of Minho which did the software engineering consulting, Assymetric Studios with design, the IDW with hardware, and Keep Solutions with maintenance and support.

My thanks to Miguel Ferreira of the University of Minho for answering my questions about RODA.

The WARC format for web archiving is now ISO 28500:2009. The format is used by the Internet Archive.

Here’s the release from the Library of Congress:

The International Internet Preservation Consortium is pleased to announce the publication of the WARC file format as an international standard: ISO 28500:2009, Information and documentation — WARC file format.  []

For many years, heritage organizations have tried to find the most appropriate ways to collect and keep track of World Wide Web material using web-scale tools such as web crawlers. At the same time, these organizations were concerned with the requirement to archive very large numbers of born-digital and digitized files. A need was for a container format that permits one file simply and safely to carry a very large number of constituent data objects (of unrestricted type, including many binary types) for the purpose of storage, management, and exchange. Another requirement was that the container need only minimal knowledge of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is an extension of the ARC format [ ], which has been used since 1996 to store files harvested on the web. WARC format offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, either retrieved by HTTP or another protocol.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium [ ], whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC Standards Working Group put forward to ISO TC46/SC4/WG12 a draft presenting the WARC file format. The draft was accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the Bibliothèque nationale de France [ ] as convener, collaborated closely with IIPC experts to improve the original draft. The WG12 will continue to maintain [ ] the standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the WARC format. It will help web archiving entering into the mainstream activities of heritage institutions and other branches, by fostering the development of new tools and ensuring the interoperability of collections. Several applications are already WARC compliant, such as the Heritrix [ ] crawler for harvesting, the WARC tools [ ] for data management and exchange, the Wayback Machine [ ], NutchWAX [ ] and other search tools [ ] for access. The international recognition of the WARC format and its applicability to every kind of digital object will provide strong incentives to use it within and beyond the web archiving community.

Abbie Grotke
Library of Congress
IIPC Communications Officer

The reason why I’ve been away from digi preservation for so long is that I’ve been managing the move of our paper archives from one repository to another.  The move itself has gone more smoothly than I dared to hope: everything happened on schedule, the ICT didn’t let me down, all the boxes fitted their new locations so my sums must have been accurate enough… it’s taken three weeks to move a mile of paper and parchment archives.

Our understanding is that we’re the first people to control a UK local authority repository move with barcodes. It’s taken 2.5 years of preparation, mainly spent in getting all our barcode data onto CALM, but the result this week is that all we had to do was a massive zap of all the barcodes in the building (that’s taken 48 hours), upload the data into CALM’s locations module, and voila! – we now know where everything is.

Boxes hundreds of em

Above: boxes in the new repository.  At the old record office we had boxes of different sizes and formats scattered throughout the building. In the new repository we have been very strict in storing boxes purely by format even if it means splitting collections up. We’re relying totally on the barcodes to find them.


Rolled maps in linen bags. Every single individual package, whether it’s a roll, a box, a freestanding volume or a folder in a drawer, has its own barcode.

James Dear

Even the 19th century portrait which we have on deposit has its own barcode!


Here one of my colleagues is zapping the boxes on their new shelves. First we zap the shelf (all the shelves have their own barcodes) and then we zap the items on it. This raw data gets imported into Excel where lookup tables replace the numbers with human readable information (eg replacing “L012345” with “Bay R6 shelf D”). Then it all goes into CALM’s locations module, so that it links automatically with the documents’ catalogue entries.

The methodology took us months to work out, followed by two years of repackaging work and sticking barcodes on everything, just to result in 48 hours of zapping in the new repository.

Here’s some details about our barcode methodology in the National Archives’s RecordKeeping magazine (we’re on page 36).

It’s been a long, long project but it’s all gone smoothly and I feel rather chuffed to have managed the move in a new way. Back to digi preservation soon!