You are currently browsing the tag archive for the ‘oais’ tag.
RODA (Repository of Authentic Digital Objects) is a Portuguese initiative to preserve government authentic digital objects. It is based on Fedora Commons, and supports the preservation of text documents, raster images, relational databases, video and audio. It runs in Java on a suitable browser. RODA’s core preservation stragey is migration, but it keeps the original representation too, so it should be OK to open old files on emulated systems.
It’s in its final stages of preparation now; a demo is available at http://roda.di.uminho.pt/?locale=en#home. I’ve created the screengrabs below myself while exploring the demo.
My notes are very brief. If you go along to the demo you will discover that two of the PDF documents preseved there are papers explaining more about the principles, systems and strategy behind the RODA project.
RODA is OAIS-compliant, so let’s run through this in OAIS order.
The SIP: this comprises the digital original and its metadata, all inside a METS envelope which is then zipped. Preservation metadata is a PREMIS record and descriptive metadata is in a segment of EAD. Technical metadata would also be nice but RODA’s creators say it “is not mandatory as is seldom created by producers.”
Files included in the SIP are accompanied by checksums and are checked for viruses. Neatly, there are a number of ways that producers can create SIPs, one of which is a dedicated app called RODA-in.
Files in non-approved preservation formats (eg JPGs) are then normalised into formats which are approved (eg TIFFs). At that point they become AIPs. Approved formats are PDF/A for text (and for powerpoint presentations too, to judge from the examples on the demo), TIFF for images, MPEG-2 for video, WAV for audio, and DBML, this last one being an XML schema devised by the RODA team themselves for databases. Files in other formats are normalised by going through a normalisation plugin; “plugins can easily be created to allow ingestion of other formats not in the list.”
The AIP: if the archivist approves the SIP, and if it contains a normalised representation, then it becomes an AIP, and the customer can either search for it (simple search or a smart-looking advanced search) or browse the classification tree. The customer can view descriptive metadata, preservation metadata, previews of the data (depending on what sort of data it is) and the data itself.
The photo and book-style previews are beautiful. I never knew Portugal looked like this
Security: currently the demo is open, but when it’s finally in action all users will be authenticated prior to accessing the repository, and all user actions will be logged. No anonymous users will be allowed. All preservation actions, such as format conversions, are likewise recorded. Permissions can be fine-tuned so that they apply from repository level all the way down to individual data objects. If a user does not have permission to view a specific item then it will not show in their search results.
And at the end of it all, the system can create stats!
One thing which immediately strikes is the clean finish to its user interface, the RODA WUI layer (RODA Web User Interface). Very, very cool.
The Portuguese team has clearly put in a great deal of time and skill here. The project team is comprised of the Portuguese National Archives who carried out archiving consulting and development, the University of Minho which did the software engineering consulting, Assymetric Studios with design, the IDW with hardware, and Keep Solutions with maintenance and support.
My thanks to Miguel Ferreira of the University of Minho for answering my questions about RODA.
Back in January 2004 Brian Lavoie of the OCLC produced a Technology Watch Report for the Digital Preservation Coalition which gives an introduction and commentary on the OAIS model. It’s a useful paper to read as it is always interesting to discover what someone else thinks of OAIS and what are its main aspects and implications. The report is available on the DPC website here.
The points made by Brian which I found most illuminating are:
Intellectual Property Rights. It is not enough to acquire the objects themselves. If the OAIS has to migrate the object to a new format, it must have the explicit right to do so.
Designated Community. The DC lies at the core of the whole digital preservation exercise. “It is the scope of the Designated Community that determines both the contents of the OAIS and the forms in which the contents are preserved, such that they remain available to, and independently understandable by, the Designated Community.” In other words, the DC is not determined by the archive’s holdings, but vice versa.
Brian also points out the difference between Consumers and the DC, a distinction which had passed me by. The contents of an OAIS might be freely available to anyone, but the DC is only that group of individuals possessing sufficent specialised knowledge to use the archived objects without expert assistance. For instance, the DC for a financial organisation might be financial professionals. The Representation Information needs to be geared towards them, even if the same info can be viewed by the general public. If you actually define the DC as the general public then your RI needs to become much larger and more comprehensive.
Archival storage. Neither Producers nor Consumers directly access the AIPs, or have any form of direct contact with the Archival Storage entity.
The meaning of “OAIS-compliant.” Brian says this is actually a very vague term, and might only mean that a digital repository’s architecture and data model can be mapped across to OAIS in some way. Organisations sometimes claim OAIS compliance without clarifying what they actually mean.
Noted from OAIS.
Section 5 of the OAIS model explicitly addresses practical approaches to preserve digital information. The model immediately ties its colours to the migration mast. “No matter how well an OAIS manages its current holdings, it will eventually need to migrate much of its holdings to different media and/or to a different hardware or software environment to keep them accessible” (5.1). Emulation is mentioned later on, but always with some sort of proviso or concern attached.
OAIS identifies three main motivators behind migration (5.1.1). These are:
- keeping the repository cost-effective by taking advantage of new technologies and storage capabilities
- staying relevant to changing consumer expectations
- simple media decay.
OAIS then models four primary digital migration types (5.1.3). In order of increasing risk of information loss, they are:
- refreshment of the bitstream from old media to newer media of the exact same type, in such a way that no metadata needs to be updated. Example: copying from one CD to a replacement CD.
- replication of the bitstream from old media to newer media, and for which some metadata does need updating. The only metadata change would be the link between the AIP’s own unique ID and the location on the storage of the AIP itself (the “Archival Storage mapping infrastructure”). Example: moving a file from one directory on the storage to another directory.
- repackaging the bitstream in some new form, requring a change to the Packaging Information. Example: moving files off a CD to new media of different type.
- transformation of the Content Information or PDI, while attempting to preserve the full information content. This last one is the one we traditionally term “migration,” and is the one which poses the most risk of information loss.
In practice there might be mixtures of all these. Transformation is the biggie, and section 22.214.171.124 goes into it in some detail. Within transformation you can get reversible transformation, such as replacing ASCII codes with UNICODE codes, or using a lossless compression algorithm; and non-reversible transformation, where the two representations are not semantically equivalent. Whether NRT has preserved enough information content may be difficult to establish.
Because the Content Information has changed in a transformation, the new AIP qualifies as a new version of the previous AIP. The PDI should be updated to identify the source AIP and its version, and to describe what was done and why (5.1.4). The first version of the AIP is referred to as the original AIP and can be retained for verification of information preservation.
The OAIS Model also looks at the possibility of improving or upgrading the AIP over time. Strictly speaking, this isn’t a transformation, but is instead creating a new Edition of an AIP, with all its own associated metadata. This can be viewed as a replacement for a previous edition, but it may be useful to retain the previous edition anyway.
There’s also a Derived AIP, which could be a handy extraction of information aggregated from multiple AIPs. But this does not replace the earlier AIPs.
All that is fine for pure data. But what if the look and feel needs preserving too?
The easy thing to do in the short to medium term is simply to pay techies to port the original software to the new environment. But OAIS points out that there are hidden problems. It may not be obvious when the app runs that it is functioning incorrectly. Testing all possible output values is unlikely to be cost effective for any particular OAIS. Commercial bridges, which are commercially provided conversion SW packages transforming data to other forms with similar look and feel, suffer from the same problems, and in addition give rise to potential copyright issues.
“If source code or commercial bridges are not available and there is an absolute requirement for the OAIS to preserve the Access look and feel, the OAIS would have to experiment with “emulation” [sic] technology” (5.2.2).
Emulation of apps has even more problems than porting. If the output isn’t visible data but is something like (eg) sound, then it becomes nearly impossible to know whether the current output is exactly the same as the sound made 20 years ago on a different combination of app and environment. We would need to also record the sound in some other (non-digital!) form, to use as validation information.
A different approach would be to emulate the hardware instead. But the OAIS model has an excellent paragraph summarising the problems here, too, which I’ll quote in full (in 126.96.36.199):
“One advantage of hardware emulation is the claim that once a hardware platform is emulated successfully all operating systems and applications that ran on the original platform can be run without modification on the new platform. However, this does not take into account dependencies on input/output devices. Emulation has been used successfully when a very popular operating system is to be run on a hardware system for which it was not designed, such as running a version of Windows™ on an Apple™ machine. However even in this case, when strong market forces encourage this approach, not all applications will necessarily run correctly or perform adequately under the emulated environment. For example, it may not be possible to fully simulate all of the old hardware dependencies and timings, because of the constraints of the new hardware environment. Further, when the application presents information to a human interface, determining that some new device is still presenting the information correctly is problematical and suggests the need to have made a separate recording of the information presentation to use for validation. Once emulation has been adopted, the resulting system is particularly vulnerable to previously unknown software errors that may seriously jeopardize continued information access. Given these constraints, the technical and economic hurdles to hardware emulation appear substantial.”
Noted from OAIS.
The OAIS reference model groups all the various processes happening within an archive into six basic entities.
The Ingest entity receives the SIP and turns it into an AIP for storage within the OAIS. This is the point at which a record may migrate from one file format to another. The Ingest people do detailed technical negotiating with Producers, create the Descriptive Information, check the record’s authenticity and so on.
Noted from OAIS. It strikes me that the concept of the Designated Community is central to how an OAIS even begins to think about its digital preservation. No one is saving records just for fun. They save records so that someone else will consult them at a later date. How we define ‘someone else,’ together with their interests and concerns, determines what features we need to preserve.
The atom unit here is the Consumer, which is defined in the Model (1.7.2) as “those persons or client systems who interact with OAIS services to find preserved information of interest and to access that information in detail. This can include other OAISs as well as internal OAIS persons or systems.” The Consumer is the entity which receives a DIP. Read the rest of this entry »
Noted from OAIS. Representation Information is a crucial concept, as it is only through our understanding of the Representation Information that a Data Object can be opened and viewed. The Representation Information itself can only be interpreted with respect to a suitable Knowledge Base.
The Representation Information concept is also inextricably tied in with the concept of the Designated Community, becuase how we define the Designated Community (and its associated Knowledge Base) determines how much Representation Information we need. “The OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained… Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding” (2.2.1).
Noted from the OAIS model.
The OAIS model generally is not prescriptive, but it contains one section (3.1) where it lays out the responsibilities that an organisation must discharge in order to operate as an OAIS. These are:
1. Negotiate with Information Producers and accept appropriate information from them. This is simply the digital equivalent of what any record office does, though an OAIS in practice needs to gather much more information about a given accession, for PDI purposes.
2. Obtain sufficient control of the information to the level needed to ensure long term preservation. In a paper archive this is largely (a) keeping the stuff in a box and (b) capturing any access, copyright and legal restrictions as necessary. In a digital repository there is (c) the need to capture all the technical metadata for PDI purposes too. There may be additional legal issues as well, concerning authenticity, software copyright etc. “It is important for the OAIS to recognize the separation that can exist between physical ownership or possession of Content Information and ownership of intellectual property rights in this information” (3.2.2). The OAIS in practice may need to obtain authority to migtae Content Information to new representation forms.
3. Determine which groups should become the Designated Community able to understand the information. This is a more important task in a digital archive than a paper one, because how we define the DC determines what sort and level of Representation Information we need to keep alongside the Content data. The DC may change over time. OAIS suggests (3.2.3) that selecting a broader rather than a narrower definition helps long term preservation, as it means that more detailed RI is captured at an early stage, rather than leaving it until later.
4. Ensure that the preserved information is independently understandable to the DC, so that no further expert assistance is needed. [AA: this is an interesting point as paper repositories often work in the opposite way: the DC is so large ("the general public") that a searchroom has to employ professional archivists and well-trained archive assistants to be on hand to explain the documents to the visitor.] The quality of being “independently understandable” will change over time. This means that RI will have to be updated as the years go by, even if the DC itself does not change.
5. Follow documented policies and procedures to ensure that (a) the information can be preserved against all reasonable contingencies, and (b) the information can be disseminated as authenticated copies of the original or as traceable back to the original. Section 3.2.5 suggests that these policies should be available to producers, consumers and any related repositories, and that the DC should be monitored so that the Content Information is still understandable to them. An OAIS should also have a long term technology usage plan.
6. Makes the preserved data available to the DC. An OAIS should have published policies on access and restrictions, so that the rights of all parties are protected.
Noted from the OAIS model.
In response to a request from a Consumer, the OAIS provides all or part of an AIP, or many AIPs, in the form of a DIP. The DIP doesn’t have to have complete PDI. DIPs are supplied by the Access entity within an OAIS, and can be supplied either on- or off-line.
“The Consumer uses an OAIS supplied Ordering Aid to develop an order request to acquire the data. The Consumer produces a logical view of the desired AIPs and associated Package Descriptions to be included in the Dissemination Information package and specifies the physical details of the Data Dissemination session such as media type and object format. This process may involve no visible interaction if adequate defaults exist. This order can also specify any transformations the Consumer wishes applied to the AIPs in creating the DIP” (4.3.4).
Noted from the OAIS model.
SIPs get transformed into one of these for preservation. The AIP “is defined to provide a concise way of referring to a set of information that has, in principle, all the qualities needed for permanent, or indefinite, Long Term Preservation of a designated Information Object …. though the implementation of the AIP may vary from archive to archive, the specification of the AIP as a container that contains all the needed information to allow Long Term Preservation and access to archive holdings remains valid” (188.8.131.52 and 3).
The AIP has a complete set of PDI for the associated Content information. The Packaging Information of the AIP will conform to OAIS internal data formatting and documentation standards, and may vary over time as the OAIS changes its practices. Transforming a SIP into an AIP “may involve file format conversions, data representation conversions or reorganisation of the content information” (184.108.40.206).
AIPs are managed within the OAIS by the Archival storage entity (4.1). Functions include managing the storage, refreshing the media, performing routine and special error checking, and providing disaster recovery capabilities: see 220.127.116.11 for details of all these.
Some AIPs may only exist as the output of algorithms operating on other AIPs (3.2.6).
Section 18.104.22.168 of the model refers to two AIP subtypes. The Archival Information Unit is the “atom” which the archive is asked to store. A single AIU contains exactly one Content Information object (which in turn may be multiple files, however) and exactly one set of PDI (22.214.171.124). The example they give is a digital movie. This AIU would contain three objects:
the digital encoding of the movie in a proprietary format
the Representation Information needed to understand this format (these two form the Content Info)
PDI: date of creation, featured actors, movie studio, etc, and a checksum for integrity.
The second subtype is the Archival Information Collection. There might be millions of AIUs, you see, so the answer is to aggregate them into AICs using criteria determined by the archivist (126.96.36.199). A single AIP can belong to multiple AICs. The AIC itself is a complete AIP which contains PDI. The PDI provides further info such as when and why it was created, context to related AICs, desired levels of security etc.
Borghoff et al point out that OAIS does not allow for changes in stored AIPs. Instead, the AIP must be extracted from the archive as a DIP, modified, and then resubmitted as a SIP. “We hope that for trivial changes the archiving systems will provide more pragmatic and simpler solutions” (p. 52).
Noted from the OAIS model.
SIPs are sent to the OAIS archive by Producers. Producers are authors, organisations or even programs which deliver documents to the OAIS. Some submissions will have insufficient Representation Information or Preservation Description Information to meet stringent AIP requirements, which is why they cannot necessarily be AIPs.
The form of the SIP will typically be negotiated between the Producer and the OAIS (2.2.3). Most SIPs will have some Content Information and some PDI, but it may require several submissions to form an AIP. If there are multiple SIPs which use the same Representation Information it is likely that this RI will only be provided once to the OAIS (188.8.131.52).
Ideally there should be a submission agreement between the Producer and the OAIS, specifying criteria like file formats, subject matter, ingest schedule, access restrictions, verification protocols, etc (2.3.2). “Considerable iteration may be required to agree on the right information to be submitted, and to get it into forms acceptable to the OAIS” (3.2.1). You also need to negotiate legal aspects, such as authority to migrate the Content Information to new representation forms (3.2.2). Data submission formats, procedures and deliverables must be documented in the OAIS’s data submission policies (184.108.40.206).
The Ingest entity (220.127.116.11) in an OAIS accepts SIPs, performs QA on them, and generates an AIP. QA might involve checksums or cyclic redundancy checks. If there are errors in the SIP submission then Ingest will request a resubmission. Ingest then transforms the SIPs into AIPs, which might include file format conversion, reorganisation, transfer to different media etc. “An OAIS is not always required to retain the information submitted to it in precisely the same format as in the SIP” (4.3.2). At the very least it will add a unique identifier.
Section 4.3.2 has some examples of SIP to AIP data transformations, such as one-to-one, one -to-many or many-to-one.