You are currently browsing the tag archive for the ‘migration’ tag.

RODA

RODA

RODA (Repository of Authentic Digital Objects) is a Portuguese initiative to preserve government authentic digital objects. It is based on Fedora Commons, and supports the preservation of text documents, raster images, relational databases, video and audio. It runs in Java on a suitable browser. RODA’s core preservation stragey is migration, but it keeps the original representation too, so it should be OK to open old files on emulated systems.

It’s in its final stages of preparation now; a demo is available at http://roda.di.uminho.pt/?locale=en#home. I’ve created the screengrabs below myself while exploring the demo.

My notes are very brief. If you go along to the demo you will discover that two of the PDF documents preseved there are papers explaining more about the principles, systems and strategy behind the RODA project.

RODA is OAIS-compliant, so let’s run through this in OAIS order.

The SIP: this comprises the digital original and its metadata, all inside a METS envelope which is then zipped. Preservation metadata is a PREMIS record and descriptive metadata is in a segment of EAD. Technical metadata would also be nice but RODA’s creators say it “is not mandatory as is seldom created by producers.”

Files included in the SIP are accompanied by checksums and are checked for viruses.  Neatly, there are a number of ways that producers can create SIPs, one of which is a dedicated app called RODA-in.

Ingest. The system logs all SIPs which are in progress

Ingest. The system logs all SIPs which are in progress

Files in non-approved preservation formats (eg JPGs) are then normalised into formats which are approved (eg TIFFs). At that point they become AIPs. Approved formats are PDF/A for text (and for powerpoint presentations too, to judge from the examples on the demo), TIFF for images, MPEG-2 for video, WAV for audio, and DBML, this last one being an XML schema devised by the RODA team themselves for databases. Files in other formats are normalised by going through a normalisation plugin; “plugins can easily be created to allow ingestion of other formats not in the list.”

The AIP: if the archivist approves the SIP, and if it contains a normalised representation, then it becomes an AIP, and the customer can either search for it (simple search or a smart-looking advanced search) or browse the classification tree. The customer can view descriptive metadata, preservation metadata, previews of the data (depending on what sort of data it is) and the data itself.

Preservation metadata can be viewed as a timeline

Preservation metadata can be viewed as a timeline

An AIP. This is for a series of images; text documents, sound files etc all look different

An AIP. This is for a series of images; text documents, sound files etc all look different

Previews of specific images in the AIP

Previews of specific images in the AIP

The photo and book-style previews are beautiful. I never knew Portugal looked like this 🙂

Security: currently the demo is open, but when it’s finally in action all users will be authenticated prior to accessing the repository, and all user actions will be logged. No anonymous users will be allowed. All preservation actions, such as format conversions, are likewise recorded. Permissions can be fine-tuned so that they apply from repository level all the way down to individual data objects. If a user does not have permission to view a specific item then it will not show in their search results.

And at the end of it all, the system can create stats!

The Administrator account can see stats

The Administrator account can see stats

One thing which immediately strikes is the clean finish to its user interface, the RODA WUI layer (RODA Web User Interface). Very, very cool.

The Portuguese team has clearly put in a great deal of time and skill here.  The project team is comprised of the Portuguese National Archives who carried out archiving consulting and development, the University of Minho which did the software engineering consulting, Assymetric Studios with design, the IDW with hardware, and Keep Solutions with maintenance and support.

My thanks to Miguel Ferreira of the University of Minho for answering my questions about RODA.

Geoffrey Brown of the Indiana University Department of Computer Science has a nice presentation available online which talks about the CIC Floppy Disk Project, and which along the way argues the case for emulation. The CIC FDP is intended to make publications deposited with federal US libraries available via FTP over the Web. In many cases this means saving not just the contents of the floppy, but also the applications needed to make the contents readable. One of his diagrams makes the point that emulation results in two separate repositories, one for documents and the other for software.

The project doesn’t appear to be strict emulation, in that some leeway is allowed. For instance, slide no. 16 bullets the software necessary for the project, one of which is Windows 98, even though “most disks were for msdos, Win 3.1”. I take that to mean that while most floppies were created on Win 3.1, they work just as well in Win 98, so let’s use Win 98 instead. Strict emulation theory probably isn’t too happy with that.

Slide 21 is the most interesting as it contains a handy summary of the problems of migration:

  • Loss of information (e.g. word edits)
  • Loss of fidelity (e.g. “WordPerfect to Word isn’t very good”). WordPerfect is one of the apps listed earlier as necessary for their emulation.
  • Loss of authenticity: users of migrated document need access to the original to verify authenticity [AA: but this depends on how you define authenticity, surely?]
  • Not always technically possible ( e.g. closed proprietary formats)
  • Not always practically feasible (e.g. costs may be too high)
  • Emulation may necessary anyway to enable migration.

ksmith1.jpgPlanning and Implementing Electronic Records Management: a practical guide  (Hardcover) by Kelvin Smith (Author), Publisher: Facet Publishing (Oct 2007), ISBN-10: 185604615X. Available from Amazon. Chapter 8 concerns Preservation, especially ‘long-term’, which is defined (p.130) as being ‘greater than one generation of technology.’ Unlike other books I have read so far, Smith’s approach is largely standards-based.

Smith begins by making the interesting point that there is still “a certain amount of distrust” of electronic records (p.129), and that people still seem to be happier with paper for preservation. This is no longer acceptable.

Smith then looks at four core challenges (authenticity, reliability, integrity and useability) in the light of ISO 15489. Authenticity is not an either/or thing: there is a sliding scale of authenticity, and the higher of number of requirements which have been met, the stronger the presumption of authenticity. Likewise, integrity does not mean that a record is unchanged: it means that only authorized and appropriate changes have been made.

Other standards relevant to digital preservation are

  • ISO 17799 Information security management (a revision of BS 7799)
  • BIP 0008 Code of practice for legal admissability etc of electronic information
  • e-GIF the UK e-Government Interoperability Framework
  • OAIS Open Archival Information System
  • BS 4783 Recommended environmental storage for digital media
  • BS 25999 Business continuity best practice

File formats

Smith says there is a case for creating the records properly in a sustainable format to begin with. [See I have a cup of coffee. AA] It’s more cost-effective for an organisation to take preservation factors into account at the beginning of the life cycle than halfway along. TNA have guidance on selecting good file formats, and e-GIF is useful here too.

But if you decide to create records in a short term or proprietary format then you need to mull over migration vs. emulation. Smith summarises the usual pros and cons. The only interesting additional points he makes are that (a) migration should always support business needs as well as preserve record content, ie. you don’t want to migrate to a format you cannot directly search or copy from, and (b) any migration strategy should integrate with existing corporate policies and procedures (especially BIP 0008). His RM policies mindset is coming through clearly here.

Databases

Smith’s book is the only one I have read so far to include a section on database preservation, and it’s short (less than a page). Preservation depends really on what sort of database it is: in some DBs old data is overwritten by new data, while in others data is never removed or overwritten. Similarly, some DBs are time or project-limited (such as surveys) while others carry on indefinitely. The usual approach is a simple all-or-nothing snapshot of the data which is then converted to some standard form rather than its native one. In addition some systems preserve an audit trail alongside, capturing every alteration made to records.

Implementing the preservation strategy

Smith then finishes the chapter with an excellent three page summary of the key steps you need to undertake, practically, to implement a strategy. A 6-point summary of his 11-point summary:

  • work with records creators and archivists to appraise and select records for permanent preservation
  • identify the right people within your own organisation to carry out preservation
  • decide on a technical preservation approach, and work with ICT people to see that it is carried out and properly tested
  • verify that the approach has worked ok. And keep a temporary backup of everything until you know it has worked
  • keep metadata and documentation on everything
  • keep all the stakeholders in the loop.

He also recommends getting authority to destroy the original e-records when the preservation has been carried out successfully, ie that the records are usable, authentic and reliable.

oais1.jpgNoted from OAIS.

Section 5 of the OAIS model explicitly addresses practical approaches to preserve digital information. The model immediately ties its colours to the migration mast. “No matter how well an OAIS manages its current holdings, it will eventually need to migrate much of its holdings to different media and/or to a different hardware or software environment to keep them accessible” (5.1). Emulation is mentioned later on, but always with some sort of proviso or concern attached.

Migration 

OAIS identifies three main motivators behind migration (5.1.1). These are:

  • keeping the repository cost-effective by taking advantage of new technologies and storage capabilities
  • staying relevant to changing consumer expectations
  • simple media decay.

OAIS then models four primary digital migration types (5.1.3). In order of increasing risk of information loss, they are:

  • refreshment of the bitstream from old media to newer media of the exact same type, in such a way that no metadata needs to be updated. Example: copying from one CD to a replacement CD.
  • replication of the bitstream from old media to newer media, and for which some metadata does need updating. The only metadata change would be the link between the AIP’s own unique ID and the location on the storage of the AIP itself (the “Archival Storage mapping infrastructure”). Example: moving a file from one directory on the storage to another directory.       
  • repackaging the bitstream in some new form, requring a change to the Packaging Information. Example: moving files off a CD to new media of different type.
  • transformation of the Content Information or PDI, while attempting to preserve the full information content. This last one is the one we traditionally term “migration,” and is the one which poses the most risk of information loss.

In practice there might be mixtures of all these. Transformation is the biggie, and section 5.1.3.4 goes into it in some detail. Within transformation you can get reversible transformation, such as replacing ASCII codes with UNICODE codes, or using a lossless compression algorithm; and non-reversible transformation, where the two representations are not semantically equivalent. Whether NRT has preserved enough information content may be difficult to establish.

Because the Content Information has changed in a transformation, the new AIP qualifies as a new version of the previous AIP. The PDI should be updated to identify the source AIP and its version, and to describe what was done and why (5.1.4). The first version of the AIP is referred to as the original AIP and can be retained for verification of information preservation.

The OAIS Model also looks at the possibility of improving or upgrading the AIP over time. Strictly speaking, this isn’t a transformation, but is instead creating a new Edition of an AIP, with all its own associated metadata. This can be viewed as a replacement for a previous edition, but it may be useful to retain the previous edition anyway.

There’s also a Derived AIP, which could be a handy extraction of information aggregated from multiple AIPs. But this does not replace the earlier AIPs.

Emulation 

All that is fine for pure data. But what if the look and feel needs preserving too?

The easy thing to do in the short to medium term is simply to pay techies to port the original software to the new environment. But OAIS points out that there are hidden problems. It may not be obvious when the app runs that it is functioning incorrectly. Testing all possible output values is unlikely to be cost effective for any particular OAIS. Commercial bridges, which are commercially provided conversion SW packages transforming data to other forms with similar look and feel, suffer from the same problems, and in addition give rise to potential copyright issues.

“If source code or commercial bridges are not available and there is an absolute requirement for the OAIS to preserve the Access look and feel, the OAIS would have to experiment with “emulation” [sic] technology” (5.2.2).

Emulation of apps has even more problems than porting. If the output isn’t visible data but is something like (eg) sound, then it becomes nearly impossible to know whether the current output is exactly the same as the sound made 20 years ago on a different combination of app and environment. We would need to also record the sound in some other (non-digital!) form, to use as validation information.

A different approach would be to emulate the hardware instead. But the OAIS model has an excellent paragraph summarising the problems here, too, which I’ll quote in full (in 5.2.2.2):

 “One advantage of hardware emulation is the claim that once a hardware platform is emulated successfully all operating systems and applications that ran on the original platform can be run without modification on the new platform. However, this does not take into account dependencies on input/output devices. Emulation has been used successfully when a very popular operating system is to be run on a hardware system for which it was not designed, such as running a version of Windows™ on an Apple™ machine. However even in this case, when strong market forces encourage this approach, not all applications will necessarily run correctly or perform adequately under the emulated environment. For example, it may not be possible to fully simulate all of the old hardware dependencies and timings, because of the constraints of the new hardware environment. Further, when the application presents information to a human interface, determining that some new device is still presenting the information correctly is problematical and suggests the need to have made a separate recording of the information presentation to use for validation. Once emulation has been adopted, the resulting system is particularly vulnerable to previously unknown software errors that may seriously jeopardize continued information access. Given these constraints, the technical and economic hurdles to hardware emulation appear substantial.” 

Top stuff.

Heidi is an anlyst at the Enterprise Strategy Group and her thoughts on digital archiving in 2008 are available here.

The main points which interest me in Heidi’s article are:

  • too many companies get archiving mixed up with backups. But these are two wholly distinct concepts
  • archiving to tape is too expensive in terms of staff time taken to retrieve an item, while archiving to primary storage also has cost implications in that you are probably making too many unnecessary backups

Her suggested solution is setting up some sort of automated migration. Manual migration (even manual checking of automated migration, presumably?) will be simply unable to cope with the enormous increase of data expected over the next few years.

There is a cool graph in the article showing how much data is expected to exist by 2010 – 27000 petabytes, probably.

Here’s what got approved last year:

  • Records are always accepted for preservation if they (a) meet the terms of the normal collecting policy and (b) are in a format openable on the current IT platform. If the records meet condition (a) but not (b), the accession will be discussed first by the Technical Services manager.
  • Records are preserved only in popular or well-supported file formats, whether proprietary or not (eg .doc, .jpeg). The full list appears below.
  • Accessions are revisited every 12 months to check that the file format is not in danger of becoming obsolete. If the format is in danger, then the records are migrated to a replacement format.

What could be added, perhaps:

  • The end user customer will not be able to consult the original record, only a copy of that record.
  • Records will not necessarily be made available in the same format that they were created in.
  • The original bitstream of the record will be kept alongside any migrated versions, to enable a future emulation to be carried out, if that is deemed necessary.

Approved file formats:

bmp image file
csv comma separated value
doc document file
dot document template
gif image file
htm web page
html web page
jpeg image file
jpg image file
mdb database file
pdf portable document format
ppt presentation file
prn space delimited spreadsheet
psd image file
pub desktop publishing file
rtf rich text format
tab tab delimited spreadsheet
tif image file
tsv tab separated value
txt text file
wav windows audio file
wma windows media file
xls spreadsheet
xlt spreadsheet template

Unlike Microsoft’s suite, which creates files in a proprietary formats, Open Office’s files are in formats which are open. ODF (Open Document Format) is an ISO standard and a European Union recommendation. OpenOffice Writer can itself convert a file from DOC to ODT. Unfortunately the conversion doesn’t always work. (Give examples?)

“The other disadvantage of Open Document Format is that even for simple documents it is extremely complex. For example, unzipping a one-page document of about 120 words results in a collection of files totalling 300K in size. This makes it relatively difficult to locate the meaningful content and structure and transform it into other formats for viewing or other uses. Instead of leaving documents in this complex format and having a hard job writing converters (XSLT stylesheets) for all possible future uses, it would be better to store documents in a simple, clear, well-structured format that makes converters easier to write.” (Ian Barnes of the Australian National University, Preservation of word processing documents (2006), available at here, accessed 29.11.07.)

Issues with the ZIP format too. ZIP is ok now, as ZIP files can be opened by any major platform, and that doesn’t look as if it is going to change. On the other hand, a corruption in the file can result in the loss of the entire file.