You are currently browsing the tag archive for the ‘open source’ tag.

RODA

RODA

RODA (Repository of Authentic Digital Objects) is a Portuguese initiative to preserve government authentic digital objects. It is based on Fedora Commons, and supports the preservation of text documents, raster images, relational databases, video and audio. It runs in Java on a suitable browser. RODA’s core preservation stragey is migration, but it keeps the original representation too, so it should be OK to open old files on emulated systems.

It’s in its final stages of preparation now; a demo is available at http://roda.di.uminho.pt/?locale=en#home. I’ve created the screengrabs below myself while exploring the demo.

My notes are very brief. If you go along to the demo you will discover that two of the PDF documents preseved there are papers explaining more about the principles, systems and strategy behind the RODA project.

RODA is OAIS-compliant, so let’s run through this in OAIS order.

The SIP: this comprises the digital original and its metadata, all inside a METS envelope which is then zipped. Preservation metadata is a PREMIS record and descriptive metadata is in a segment of EAD. Technical metadata would also be nice but RODA’s creators say it “is not mandatory as is seldom created by producers.”

Files included in the SIP are accompanied by checksums and are checked for viruses.  Neatly, there are a number of ways that producers can create SIPs, one of which is a dedicated app called RODA-in.

Ingest. The system logs all SIPs which are in progress

Ingest. The system logs all SIPs which are in progress

Files in non-approved preservation formats (eg JPGs) are then normalised into formats which are approved (eg TIFFs). At that point they become AIPs. Approved formats are PDF/A for text (and for powerpoint presentations too, to judge from the examples on the demo), TIFF for images, MPEG-2 for video, WAV for audio, and DBML, this last one being an XML schema devised by the RODA team themselves for databases. Files in other formats are normalised by going through a normalisation plugin; “plugins can easily be created to allow ingestion of other formats not in the list.”

The AIP: if the archivist approves the SIP, and if it contains a normalised representation, then it becomes an AIP, and the customer can either search for it (simple search or a smart-looking advanced search) or browse the classification tree. The customer can view descriptive metadata, preservation metadata, previews of the data (depending on what sort of data it is) and the data itself.

Preservation metadata can be viewed as a timeline

Preservation metadata can be viewed as a timeline

An AIP. This is for a series of images; text documents, sound files etc all look different

An AIP. This is for a series of images; text documents, sound files etc all look different

Previews of specific images in the AIP

Previews of specific images in the AIP

The photo and book-style previews are beautiful. I never knew Portugal looked like this 🙂

Security: currently the demo is open, but when it’s finally in action all users will be authenticated prior to accessing the repository, and all user actions will be logged. No anonymous users will be allowed. All preservation actions, such as format conversions, are likewise recorded. Permissions can be fine-tuned so that they apply from repository level all the way down to individual data objects. If a user does not have permission to view a specific item then it will not show in their search results.

And at the end of it all, the system can create stats!

The Administrator account can see stats

The Administrator account can see stats

One thing which immediately strikes is the clean finish to its user interface, the RODA WUI layer (RODA Web User Interface). Very, very cool.

The Portuguese team has clearly put in a great deal of time and skill here.  The project team is comprised of the Portuguese National Archives who carried out archiving consulting and development, the University of Minho which did the software engineering consulting, Assymetric Studios with design, the IDW with hardware, and Keep Solutions with maintenance and support.

My thanks to Miguel Ferreira of the University of Minho for answering my questions about RODA.

Yesterday I visited Gloucestershire Archives to have a look at their GAIP (Gloucestershire Archives Ingest Package) software.

GAIP is a little Perl app which is open source and nicely platform independent (yesterday we saw it in action on both XP and Fedora). Using GAIP, you can take a digital file, or a collection of files, and create a non-proprietary preservation version of it, which is then kept in a .tgz file containing the preservation version, the original, the metadata, and a log of alterations. Currently it works with image files, so that GAIP can create a .tgz containing the original bmp (for instance) as well as the png which it has created. GAIP can then also create a publication version of the image, usually a JPEG. Gloucestershire Archives are intending to expand GAIP to cover other sorts of files too: it depends on what sorts of converters they can track down.

At present GAIP uses a command line interface which isn’t terribly friendly, but this can easily be improved.

From my point of view, I was glad to have a play with GAIP as it has rekindled my optimism about low-level digital preservation. I have been in a sulk for a couple of months because the only likely solutions seemed to be big-budget applications set up by (and therefore controlled by) national-level organisations. GAIP however is a ray of local light, a sign that UK local authorities might be able to develop in-house and low budget solutions which are realistic to our own specific contexts.

Unlike Microsoft’s suite, which creates files in a proprietary formats, Open Office’s files are in formats which are open. ODF (Open Document Format) is an ISO standard and a European Union recommendation. OpenOffice Writer can itself convert a file from DOC to ODT. Unfortunately the conversion doesn’t always work. (Give examples?)

“The other disadvantage of Open Document Format is that even for simple documents it is extremely complex. For example, unzipping a one-page document of about 120 words results in a collection of files totalling 300K in size. This makes it relatively difficult to locate the meaningful content and structure and transform it into other formats for viewing or other uses. Instead of leaving documents in this complex format and having a hard job writing converters (XSLT stylesheets) for all possible future uses, it would be better to store documents in a simple, clear, well-structured format that makes converters easier to write.” (Ian Barnes of the Australian National University, Preservation of word processing documents (2006), available at here, accessed 29.11.07.)

Issues with the ZIP format too. ZIP is ok now, as ZIP files can be opened by any major platform, and that doesn’t look as if it is going to change. On the other hand, a corruption in the file can result in the loss of the entire file.

RTF is Rich Text Format. Microsoft have published the format, so in that sense it’s open, but it has some ‘quirks.’ Better check all this.

The National Library of Australia has saved much of its stuff in RTF.

Firstly because the software company can change the licensing rules. At present Word is not the only application capable of opening DOC file types, there are others, but Microsoft could change that tomorrow if they wished. Microsoft might decide to insist that only Word or another Microsoft product could open a DOC file. They would lose a lot of friends that way, but there is no legal reason why they could not so so if they wished. After all, Microsoft have spent a great deal of money investing in DOC. Microsoft could then hike up the licence fees to make opening a DOC file in the future a very expensive thing to do. Now this won’t really concern you if you are an individual or a small business, but it does worry governments (who are paid to worry about this sort of thing on our behalf). Governments create a lot of records that need to be kept forever, and they do not want to be in a situation where they have to pay a commercial company an ever-increasing amount of money just to be able to read their own documents.

Secondly, Microsoft could change the format. And in fact this is exactly what they have done over the years.

By Stuart D. Lee, 2002. Reviewed by Richard M. Davis in JSA vol 23 no 2, 2002.

Aimed at librarians and information science students, so it deals mainly with electronic format published material within a library context. Recommends using published, open standards for data storage and exchange, to best preserve data beyond the life of the host system. ‘But of course publishers have much the same reservations about giving us those sorts of freedoms as record companies do about us ripping and burning our own CDs!’

Smmary of responses from archives-nra email list:

—–Original Message—–From: Archivists, conservators and records managers. Sent: 11 December 2007 15:54

Subject: open source repository software for digital archives – summary of responses
Many thanks to those of you who responded to my enquiry about using institutional repository software to manage born-digital archives for long-term preservation. A summary of responses follows:I was recommended to consult DCC (Digital Curation Centre) and AHDS (Arts and Humanities Data Service) for advice. DCC have produced technology watch papers on the various IR software available – http://www.dcc.ac.uk/resource/technology-watch/
and AHDS have a useful webpage about the development of their repository. http://ahds.ac.uk/preservation/repository.htm
West Yorkshire Archive Service are currently testing Fedora for the purpose of managing digital archives. It was suggested to contact Wellcome Institute where Fedora has been implemented as a digital preservation testbed, and I was advised that the University of Hull’s RepoMMan Project documentation – particularly D-D4 – is much clearer and more comprehensive for beginners than the user documents on Fedora’s own websites. Finally, the University of London Computer Centre got in touch to share their experience in developing digital repositories. Staff in the Digital Preservation team developed Fedora for the recently launched Linnean Society online archive of digitised images. (http://www.linnean-online.org/)

ULCC also brought to my attention that as part of the PARADIGM project http://www.paradigm.ac.uk/index.html there was a test comparison between Fedora and Dspace software, ie for use within a Digital Archive context. It was found that Fedora seemed the better choice for digital archives, as it was more flexible and customisable. On the Digital Presevation Training Programme course at ULCC, attendees have been warned that you do need considerable IT support in order to customise software like Fedora. PARADIGM project officers, and the National Library of Wales (who have implemented Fedora) are two organisations worth contacting about specifications for a digital archive.

Some more links c/o ULCC:http://www.jisc.ac.uk/whatwedo/programmes/programme_digital_repositories.aspx Dspace at Cambridge example – http://www.dspace.cam.ac.uk/bitstream/1810/104791/1/Rosetta_Stone_paper.pdf See blog posting about comparisons of software: http://forum.dcc.ac.uk/viewtopic.php?p=212 Thanks to all who responded – I hope these links are helpful for list members. As I’m about to leave this post, my colleagues here at the Red Cross will be following this up in the new year.

Hi again,

I have been asked to correct a mistake in my previous summary about ULCC’s development of the Linnean Society digital archive. This was built on Eprints, not Fedora as I stated. My apologies for this mistake! More info here:

http://www.linnean-online.org/information.html

MF also pointed out the UK-centric nature of the resources I listed, and brought to my attention RODA in Portugal:

http://dlmforum.typepad.com/DLM2007-RODA.pdf

Thanks all!

[names etc removed]