You are currently browsing the tag archive for the ‘file formats’ tag.

RODA

RODA

RODA (Repository of Authentic Digital Objects) is a Portuguese initiative to preserve government authentic digital objects. It is based on Fedora Commons,聽and supports the preservation of text documents, raster images, relational databases, video and audio. It runs in Java on a suitable browser. RODA’s core preservation stragey is migration, but it keeps the original representation too, so it should be OK to open old files on emulated systems.

It’s in its final stages of preparation now; a demo is available at http://roda.di.uminho.pt/?locale=en#home. I’ve created the screengrabs below myself while exploring the demo.

My notes are very brief. If you go along to the demo you will discover that two of the PDF documents preseved聽there are papers explaining more about the principles, systems and strategy behind the RODA project.

RODA is OAIS-compliant, so let’s run through this in OAIS order.

The SIP: this comprises the digital original and its metadata, all inside a METS envelope which is then zipped. Preservation metadata is a PREMIS record and descriptive metadata is in a segment of EAD. Technical metadata would also be nice but RODA’s creators say it “is not mandatory as is seldom created by producers.”

Files included in the SIP are accompanied by checksums and are checked for viruses.聽 Neatly, there are a number of ways that producers can create SIPs, one of which is a dedicated app called RODA-in.

Ingest. The system logs all SIPs which are in progress

Ingest. The system logs all SIPs which are in progress

Files in non-approved preservation formats (eg JPGs) are then normalised into formats which are approved (eg TIFFs). At聽that point they become AIPs. Approved formats are PDF/A for text (and for powerpoint presentations too, to judge from the examples on the demo), TIFF for images, MPEG-2 for video, WAV for audio, and DBML, this last one being an XML schema devised by the RODA team themselves for databases. Files in other formats are normalised by going through a normalisation plugin; “plugins can easily be created to allow ingestion of other formats not in the list.”

The AIP: if the archivist approves the SIP, and if it contains a normalised representation, then it becomes an AIP, and the customer can either search for it (simple search or a smart-looking advanced search) or browse the classification tree. The customer can view descriptive metadata, preservation metadata, previews of the data (depending on what sort of data it is) and the data itself.

Preservation metadata can be viewed as a timeline

Preservation metadata can be viewed as a timeline

An AIP. This is for a series of images; text documents, sound files etc all look different

An AIP. This is for a series of images; text documents, sound files etc all look different

Previews of specific images in the AIP

Previews of specific images in the AIP

The photo and book-style previews are beautiful. I never knew Portugal looked like this 馃檪

Security: currently the demo is open, but when it’s finally in action all users will be authenticated prior to accessing the repository, and all user actions will be logged. No anonymous users will be allowed. All preservation actions, such as format conversions, are likewise recorded. Permissions can be fine-tuned so that they apply from repository level all the way down to individual data objects. If a user does not have permission to view a specific item then it will not show in their search results.

And at the end of it all, the system can create stats!

The Administrator account can see stats

The Administrator account can see stats

One thing which immediately strikes is the clean finish to its user interface, the RODA WUI layer (RODA Web User Interface). Very, very cool.

The Portuguese team has clearly put in a great deal of time and skill here.聽 The project team is comprised of the Portuguese National Archives who carried out archiving consulting and development, the University of Minho which did the software engineering consulting, Assymetric Studios with design, the IDW with hardware, and Keep Solutions with maintenance and support.

My thanks to Miguel Ferreira of the University of Minho for answering my questions about RODA.

The Koninklijke Bibliotheek in the Netherlands has produced a report Evaluating File Formats for Long-Term Preservation, available here, which introduces an evaluative scheme for assessing the fitness of a file format for preservation, and which then applies this scheme to two example formats, specifically MS Word 97-2003 doc format and PDF/A. Of course, identifying the winner of these two particular formats is easy (it might have been more interesting to see a closer contest such as ODF vs PDF/A) but it’s still an interesting exercise. The report was written by Judith Rog and Caroline van Wijk.

The scheme

Each file format is awarded a score on a particular criterion, such as “adoption: world wide usage” or “robustness: support for file corruption detection” and so on. The scores are weighted and then added together to give a total score. This total score then provides a quantifiable evaluation of how useful the format is as a way to preserve digital information for the long term.

Read the rest of this entry »

An e-government based interoperability network. One of its guiding principles is that UK public sector institutions should not become dependent on non-interoperable software products, because that could lead to monopoly or market failure.

In practice e-GIF explicity states that (a) UK bodies should adopt XML as the primary standard for data integration and presentation, and (b) they should adopt the eGovernment Metadata Standard (eGMS) for metadata. Adherence to eGIF is mandatory for UK public institutions. The eGIF Accreditation Authority checks compliance.

Key eGIF documents are available here.

Approved file formats

The useful info for digital preservation is contained with Section 7 of the Technical Standards Catalogue, version 6.2 (September 2005), available here. This section contains a list of approved file formats for various purposes. The approved e-GIF formats are:

  • text and word ptocessing: rtf, txt, htm, doc, pdf, nsf, mht
  • spreadsheets: csv, xls
  • presentation: ppt, pps
  • images: jpg, gif, png, tif, ecw
  • vector: svg, vml
  • moving: mpg
  • audio: mp3, wav, avi, mov, qt, asf, wma, wmv, swf, ra, ram, rmm, Ogg Vorbis and a few others
  • compression: zip, gz, tgz, tar
  • character sets: UNICODE and ISO 10646.

 

ksmith1.jpgPlanning and Implementing Electronic Records Management: a practical guide 聽(Hardcover) by Kelvin Smith (Author), Publisher: Facet Publishing (Oct 2007), ISBN-10: 185604615X. Available from Amazon. Chapter 8 concerns Preservation, especially ‘long-term’, which is defined (p.130) as being ‘greater than one generation of technology.’ Unlike other books I have read so far, Smith’s approach is largely standards-based.

Smith begins by making the interesting point that there is still “a certain amount of distrust” of electronic records (p.129), and that people still seem to be happier with paper for preservation. This is no longer acceptable.

Smith then looks at four core challenges (authenticity, reliability, integrity and useability) in the light of ISO 15489. Authenticity is not an either/or thing: there is a sliding scale of authenticity, and the higher of number of requirements which have been met, the stronger the presumption of authenticity. Likewise, integrity does not mean that a record is unchanged: it means that only authorized and appropriate changes have been made.

Other standards relevant to digital preservation are

  • ISO 17799 Information security management (a revision of BS 7799)
  • BIP 0008 Code of practice for legal admissability etc of electronic information
  • e-GIF the UK e-Government Interoperability Framework
  • OAIS Open Archival Information System
  • BS 4783 Recommended environmental storage for digital media
  • BS 25999 Business continuity best practice

File formats

Smith says there is a case for creating the records properly in a sustainable format to begin with. [See I have a cup of coffee. AA] It’s more cost-effective for an organisation to take preservation factors into account at the beginning of the life cycle than halfway along. TNA have guidance on selecting good file formats, and e-GIF is useful here too.

But if you decide to create records in a short term or proprietary format then you need to mull over migration vs. emulation. Smith summarises the usual pros and cons. The only interesting additional points he makes are that (a) migration should always support business needs as well as preserve record content, ie. you don’t want to migrate to a format you cannot directly search or copy from, and (b) any migration strategy should integrate with existing corporate policies and procedures (especially BIP 0008). His RM policies mindset is coming through clearly here.

Databases

Smith’s book is the only one I have read so far to include a section on database preservation, and it’s short (less than a page). Preservation depends really on what sort of database it is: in some DBs old data is overwritten by new data, while in others data is never removed or overwritten. Similarly, some DBs are time or project-limited (such as surveys) while others carry on indefinitely. The usual approach is a simple all-or-nothing snapshot of the data which is then converted to some standard form rather than its native one. In addition some systems preserve an audit trail alongside, capturing every alteration made to records.

Implementing the preservation strategy

Smith then finishes the chapter with an excellent three page summary of the key steps you need to undertake, practically, to implement a strategy. A 6-point summary of his 11-point summary:

  • work with records creators and archivists to appraise and select records for permanent preservation
  • identify the right people within your own organisation to carry out preservation
  • decide on a technical preservation approach, and work with ICT people to see that it is carried out and properly tested
  • verify that the approach has worked ok. And keep a temporary backup of everything until you know it has worked
  • keep metadata and documentation on everything
  • keep all the stakeholders in the loop.

He also recommends getting authority to destroy the聽original e-records when the preservation has been carried out successfully, ie that the records are usable, authentic and reliable.

Here’s what got approved last year:

  • Records are always accepted for preservation if they (a) meet the terms of the normal collecting policy and (b) are in a format openable on the current IT platform. If the records meet condition (a) but not (b), the accession will be discussed first by the Technical Services manager.
  • Records are preserved only in popular or well-supported file formats, whether proprietary or not (eg .doc, .jpeg). The full list appears below.
  • Accessions are revisited every 12 months to check that the file format is not in danger of becoming obsolete. If the format is in danger, then the records are migrated to a replacement format.

What could be added, perhaps:

  • The end user customer will not be able to consult the original record, only a copy of that record.
  • Records will not necessarily be made available in the same format that they were created in.
  • The original bitstream of the record will be kept alongside any migrated versions, to enable a future emulation to be carried out, if that is deemed necessary.

Approved file formats:

bmp image file
csv comma separated value
doc document file
dot document template
gif image file
htm web page
html web page
jpeg image file
jpg image file
mdb database file
pdf portable document format
ppt presentation file
prn space delimited spreadsheet
psd image file
pub desktop publishing file
rtf rich text format
tab tab delimited spreadsheet
tif image file
tsv tab separated value
txt text file
wav windows audio file
wma windows media file
xls spreadsheet
xlt spreadsheet template

Unlike Microsoft’s suite, which creates files in a proprietary formats, Open Office’s files are in formats which are open. ODF (Open Document Format) is an ISO standard and a European Union recommendation. OpenOffice Writer can itself convert a file from DOC to ODT. Unfortunately the conversion doesn’t always work. (Give examples?)

“The other disadvantage of Open Document Format is that even for simple documents it is extremely complex. For example, unzipping a one-page document of about 120 words results in a collection of files totalling 300K in size. This makes it relatively difficult to locate the meaningful content and structure and transform it into other formats for viewing or other uses. Instead of leaving documents in this complex format and having a hard job writing converters (XSLT stylesheets) for all possible future uses, it would be better to store documents in a simple, clear, well-structured format that makes converters easier to write.” (Ian Barnes of the Australian National University, Preservation of word processing documents (2006), available at here, accessed 29.11.07.)

Issues with the ZIP format too. ZIP is ok now, as ZIP files can be opened by any major platform, and that doesn’t look as if it is going to change. On the other hand, a corruption in the file can result in the loss of the entire file.

RTF is Rich Text Format. Microsoft have published the format, so in that sense it’s open, but it has some ‘quirks.’ Better check all this.

The National Library of Australia has saved much of its stuff in RTF.

You might think that word processed documents would be simple to preserve, but in fact they are not. Even the simplest documents contain tabs, bullets, indents, images, URL links, font changes, quotes, section headings, endnotes, embedded active content (such as spreadsheet cells). You could save the whole thing as a simple plain text file, but lose it all.

Yet at their core, word processed documents are too simple. They are flat, by which I mean non-hierarchical. Sections follow each other in sequence, heading and text and heading and text, but ultimately the word processing application only sees these sections in terms of their appearance. Word itself doesn’t understand that your fourth paragraph is a sub-division of the first paragraph. It is not a database. But 75 years from now, it is the structure of the document that people will be interested in, not the appearance. A word processing application takes your content and makes it look pretty.