You are currently browsing the category archive for the ‘Projects’ category.

RODA

RODA

RODA (Repository of Authentic Digital Objects) is a Portuguese initiative to preserve government authentic digital objects. It is based on Fedora Commons, and supports the preservation of text documents, raster images, relational databases, video and audio. It runs in Java on a suitable browser. RODA’s core preservation stragey is migration, but it keeps the original representation too, so it should be OK to open old files on emulated systems.

It’s in its final stages of preparation now; a demo is available at http://roda.di.uminho.pt/?locale=en#home. I’ve created the screengrabs below myself while exploring the demo.

My notes are very brief. If you go along to the demo you will discover that two of the PDF documents preseved there are papers explaining more about the principles, systems and strategy behind the RODA project.

RODA is OAIS-compliant, so let’s run through this in OAIS order.

The SIP: this comprises the digital original and its metadata, all inside a METS envelope which is then zipped. Preservation metadata is a PREMIS record and descriptive metadata is in a segment of EAD. Technical metadata would also be nice but RODA’s creators say it “is not mandatory as is seldom created by producers.”

Files included in the SIP are accompanied by checksums and are checked for viruses.  Neatly, there are a number of ways that producers can create SIPs, one of which is a dedicated app called RODA-in.

Ingest. The system logs all SIPs which are in progress

Ingest. The system logs all SIPs which are in progress

Files in non-approved preservation formats (eg JPGs) are then normalised into formats which are approved (eg TIFFs). At that point they become AIPs. Approved formats are PDF/A for text (and for powerpoint presentations too, to judge from the examples on the demo), TIFF for images, MPEG-2 for video, WAV for audio, and DBML, this last one being an XML schema devised by the RODA team themselves for databases. Files in other formats are normalised by going through a normalisation plugin; “plugins can easily be created to allow ingestion of other formats not in the list.”

The AIP: if the archivist approves the SIP, and if it contains a normalised representation, then it becomes an AIP, and the customer can either search for it (simple search or a smart-looking advanced search) or browse the classification tree. The customer can view descriptive metadata, preservation metadata, previews of the data (depending on what sort of data it is) and the data itself.

Preservation metadata can be viewed as a timeline

Preservation metadata can be viewed as a timeline

An AIP. This is for a series of images; text documents, sound files etc all look different

An AIP. This is for a series of images; text documents, sound files etc all look different

Previews of specific images in the AIP

Previews of specific images in the AIP

The photo and book-style previews are beautiful. I never knew Portugal looked like this 🙂

Security: currently the demo is open, but when it’s finally in action all users will be authenticated prior to accessing the repository, and all user actions will be logged. No anonymous users will be allowed. All preservation actions, such as format conversions, are likewise recorded. Permissions can be fine-tuned so that they apply from repository level all the way down to individual data objects. If a user does not have permission to view a specific item then it will not show in their search results.

And at the end of it all, the system can create stats!

The Administrator account can see stats

The Administrator account can see stats

One thing which immediately strikes is the clean finish to its user interface, the RODA WUI layer (RODA Web User Interface). Very, very cool.

The Portuguese team has clearly put in a great deal of time and skill here.  The project team is comprised of the Portuguese National Archives who carried out archiving consulting and development, the University of Minho which did the software engineering consulting, Assymetric Studios with design, the IDW with hardware, and Keep Solutions with maintenance and support.

My thanks to Miguel Ferreira of the University of Minho for answering my questions about RODA.

The reason why I’ve been away from digi preservation for so long is that I’ve been managing the move of our paper archives from one repository to another.  The move itself has gone more smoothly than I dared to hope: everything happened on schedule, the ICT didn’t let me down, all the boxes fitted their new locations so my sums must have been accurate enough… it’s taken three weeks to move a mile of paper and parchment archives.

Our understanding is that we’re the first people to control a UK local authority repository move with barcodes. It’s taken 2.5 years of preparation, mainly spent in getting all our barcode data onto CALM, but the result this week is that all we had to do was a massive zap of all the barcodes in the building (that’s taken 48 hours), upload the data into CALM’s locations module, and voila! – we now know where everything is.

Boxes hundreds of em

Above: boxes in the new repository.  At the old record office we had boxes of different sizes and formats scattered throughout the building. In the new repository we have been very strict in storing boxes purely by format even if it means splitting collections up. We’re relying totally on the barcodes to find them.

Cantilevers

Rolled maps in linen bags. Every single individual package, whether it’s a roll, a box, a freestanding volume or a folder in a drawer, has its own barcode.

James Dear

Even the 19th century portrait which we have on deposit has its own barcode!

Zapping

Here one of my colleagues is zapping the boxes on their new shelves. First we zap the shelf (all the shelves have their own barcodes) and then we zap the items on it. This raw data gets imported into Excel where lookup tables replace the numbers with human readable information (eg replacing “L012345” with “Bay R6 shelf D”). Then it all goes into CALM’s locations module, so that it links automatically with the documents’ catalogue entries.

The methodology took us months to work out, followed by two years of repackaging work and sticking barcodes on everything, just to result in 48 hours of zapping in the new repository.

Here’s some details about our barcode methodology in the National Archives’s RecordKeeping magazine (we’re on page 36).

It’s been a long, long project but it’s all gone smoothly and I feel rather chuffed to have managed the move in a new way. Back to digi preservation soon!

Yesterday I visited Gloucestershire Archives to have a look at their GAIP (Gloucestershire Archives Ingest Package) software.

GAIP is a little Perl app which is open source and nicely platform independent (yesterday we saw it in action on both XP and Fedora). Using GAIP, you can take a digital file, or a collection of files, and create a non-proprietary preservation version of it, which is then kept in a .tgz file containing the preservation version, the original, the metadata, and a log of alterations. Currently it works with image files, so that GAIP can create a .tgz containing the original bmp (for instance) as well as the png which it has created. GAIP can then also create a publication version of the image, usually a JPEG. Gloucestershire Archives are intending to expand GAIP to cover other sorts of files too: it depends on what sorts of converters they can track down.

At present GAIP uses a command line interface which isn’t terribly friendly, but this can easily be improved.

From my point of view, I was glad to have a play with GAIP as it has rekindled my optimism about low-level digital preservation. I have been in a sulk for a couple of months because the only likely solutions seemed to be big-budget applications set up by (and therefore controlled by) national-level organisations. GAIP however is a ray of local light, a sign that UK local authorities might be able to develop in-house and low budget solutions which are realistic to our own specific contexts.

This is the Paradigm (Personal Archives Accessible in DIGital Media) project workbook, published by the Bodleian Library in 2007. It’s available on the web, free, from the Paradigm project website, but you can now get a printed version too, which is much easier to read over a nice cup of coffee, especially as it is nearly 300 pages long.

Paradigm was a project exploring the issues involved in the long term preservation of personal digital archives, by examining in particular the archives of contemporary UK politicians. Politicians and their offices produce a chaotic welter of digital media in various formats and in a variety of states of semi-organisation, so the Paradigm project is extraordinarily useful to repositories having to deal with electronic media which they accession from outside bodies (like where I work). The workbook is enriched by a section on legal issues surrounding digital preservation, and the appendices contain paperwork templates for digital repositories, such as a model gift agreement. Top stuff.

 

 

There a handful of different strategies for archiving websites, of which a web-served archive is just one. The best example of a web-served archive is the Internet Archive .

Tech

The IA stores files of websites in warc container files. A warc file keeps a sequence of web pages and headers, the headers describing the content and length of each harvested page. A warc also contains secondary content, such as assigned metadata and transformations of original files.

Each record has an offset which is stored in an index ordered by URI. This means that it should be possible to rapidly extract individual files based on their URI. The selected files then get sent to a web server which forwards them to the client.

Doing it this way allows the naming of individual web pages to be preserved. It also scales up pretty well (the IA has a colossal amount of information).

Read the rest of this entry »

Geoffrey Brown of the Indiana University Department of Computer Science has a nice presentation available online which talks about the CIC Floppy Disk Project, and which along the way argues the case for emulation. The CIC FDP is intended to make publications deposited with federal US libraries available via FTP over the Web. In many cases this means saving not just the contents of the floppy, but also the applications needed to make the contents readable. One of his diagrams makes the point that emulation results in two separate repositories, one for documents and the other for software.

The project doesn’t appear to be strict emulation, in that some leeway is allowed. For instance, slide no. 16 bullets the software necessary for the project, one of which is Windows 98, even though “most disks were for msdos, Win 3.1”. I take that to mean that while most floppies were created on Win 3.1, they work just as well in Win 98, so let’s use Win 98 instead. Strict emulation theory probably isn’t too happy with that.

Slide 21 is the most interesting as it contains a handy summary of the problems of migration:

  • Loss of information (e.g. word edits)
  • Loss of fidelity (e.g. “WordPerfect to Word isn’t very good”). WordPerfect is one of the apps listed earlier as necessary for their emulation.
  • Loss of authenticity: users of migrated document need access to the original to verify authenticity [AA: but this depends on how you define authenticity, surely?]
  • Not always technically possible ( e.g. closed proprietary formats)
  • Not always practically feasible (e.g. costs may be too high)
  • Emulation may necessary anyway to enable migration.

dioscuri1.jpgDioscuri is (as far as I am aware) the first ever emulator created specifically with long term digital preservation in mind. It is available for download from Sourceforge, and the project’s own website is here.

This awesomely ambitious project began in 2005 as a co-operative venture between the Dutch National Archives and the Koninklijke Bibliotheek. The first working example came out in late 2007. The project has now been subsumed within the European PLANETS project.

Read the rest of this entry »

pa1.jpgAvailable online here.  The Review includes a page or so on digital preservation at the PA. They seem to have been pretty busy.

The PA team has:

  • placed digi pres “at the heart of our objectives for the next three years” which must make them almost unique among UK institutions, I would have thought
  • recruited a TNA digital records specialist and joined the DPC
  • started working with Parliament’s EDRM and ICT people
  • begun work on a Digital Preservation Strategy and an audit of known digital assets.

The Review says that a scoping report on PA digi pres should be out “June 2007”.

Brief article about the project in TNA’s RecordKeeping for Autumn 2004.

Original project

The original project was only possible because of a Government programme which had put a BBC Micro into every school in the country by 1980-81, creating a user base of compatible computers. School children in 1986 entered their own data onto their school computers, which was copied onto floppy disks or tapes sent to the BBC. All these text and images, together with analogue photographs of OS maps, were transferred to analogue videotape. The community data finally totalled 29,000 photographs and 27,000 maps. The whole database was then assembled on master videotapes from which the final videodiscs were produced. The monitor was usually a TV, which imposed a limit on the level of detail visible at once: users needed to switch between maps, pictures and text.

Restoration project

There were a number of parallel rescue projects but the one which actually worked was a collaboration between TNA, BBC and others. It did not rescue data from the videodiscs, but from the master tapes.

Independently, LongLife Data Ltd had developed a new PC interface to the community data. It works in the same way as the real one but because a modern monitor has higher resolution than a 1980s TV screen, pictures and text can be shown simultaneously. This is the version now available on the web.

Alans thoughts

  • the data was restored from analogue videotapes, not from the videodiscs or from the submitted floppy disks. After 15 years the tapes were still readable. So in a sense it’s a straightforward media refreshing thing.
  • the new interface is not an exact emulation of the old interface. It is a wholly new app. The current browsing experience has therefore lost authenticity. (Though the data is the same.)
  • can we find out anything about the authenticity of the data itself?