This site contains some detailed technical analyses of file formats suitable for long term preservation, as viewed through the lense of the LOC’s collecting policy. The site was compiled by Caroline R. Arms and Carl Fleischhauer, and is clearly the result of a great deal of thought and research. Site accessed 15 May 2008.
The purpose of the LOC’s Digital Formats website is to support the long term preservation of digital objects by (a) identifying formats promising for sustainability, (b) identifying other formats which are not promising and which therefore need alternative strategies for content preservation, and thereby (c) recommending which formats to use when building up a collection. The site concentrates on technical aspects of file formats. The site is concerned with the formats associated with media-independent (intangible) digital content, in other words content that is managed as files and which is independent of a particular physical medium; this rules out formats associated with media-dependent (tangible) digital content, such as DVDs, audio CDs, videotape formats.
The site defines formats as “packages of information that can be stored as data files or sent via network as data streams (aka bitstreams, byte streams).” This is actually a fairly broad and vague definition which encompasses many different ways of identifying specific formats, such as file formats (MIME type, extension etc), bitstream encodings (the code or wavestream which underlies many formats, such as H.264 video underlying QuickTime and MPEG-4), wrappers and bundling formats (TIFF, ZIP, METS and so on), and even overall classes of related formats, such as the RIFF family.
This is why discussions about formats rapidly become technical. The site gives the example of PDF. “PDF format” is not just a format in its own right but also acts as a wrapper for other formats (a PDF file can contain embedded JPEG images, for instance) and has gone through numerous versions over time.
The LOC site identifies seven factors which need consideration when assessing the sustainability of a given format. These factors apply across all digital formats and all information types:
1. Disclosure of the format’s documentation. LOC consider this more important than simple approval by a standards organisation.
2. Adoption, ie. the degree to which the format is already used by the primary creators, disseminators, or users of information resources. If a format is widely adopted, it is less likely to become obsolete rapidly, and tools for migration and emulation are more likely to emerge from industry without specific investment by repositories.
3. Transparency, by which they mean the degree to which the digital object is open to direct analysis, such as human readability using a text-only editor. Transparency certainly helps any digital archaeology which may need doing in the future. Transparency is enhanced if textual content (including metadata embedded in files for non-text content) is encoded in standard character encodings (such as UTF-8 Unicode) and stored in natural reading order. When preserving applications source code is more transparent than compiled code. For non-textual information, standard or basic representations are more transparent than those optimized for more efficient processing, storage or bandwidth.
Transparency is inhibited by compression, although it is recognised that some compression is inevitable, especially for formats created in compressed form (movie files, say). The LOC recommend using only content compressed using publicly disclosed and widely adopted algorithms that are either lossless or have an acceptable degree of lossy compression.
Transparency is destroyed by encryption.
4. Self-documentation of its own metadata. “Digital objects that are self-documenting are likely to be easier to sustain over the long term and less vulnerable to catastrophe than data objects that are stored separately from all the metadata needed to render the data as usable information or understand its context… The ability of a digital format to hold (in a transparent form) metadata beyond that needed for basic rendering of the content in today’s technical environment is an advantage for purposes of preservation.” For operational efficiency and research help some of this metadata is likely to be copied across to a separate metadata store.
Metadata really is a problem. The LOC notes that “many of the metadata elements that will be required to sustain digital objects in the face of technological change are not typically recorded in library catalogs or records intended to support discovery.” From my own experience I know that many humans woud rather saw their own legs off than manually type metatada for the records they create. The ability of a particular file format to capture technical metadata automatically is therefore a huge advantage. The LOC actively encourages the use of formats which hold metadata.
5. External dependencies, including hardware, software, device and OS dependencies.
6. Impact of patents or other legal issues. The LOC has an excellent paragraph stating that “although the costs for licenses to decode current formats are often low or nil, the existence of patents may slow the development of open source encoders and decoders and prices for commercial software for transcoding content in obsolescent formats may incorporate high license fees. When license terms include royalties based on use (e.g., a royalty fee when a file is encoded or each time it is used), costs could be high and unpredictable. It is not the existence of patents that is a potential problem, but the terms that patent-holders might choose to apply.”
7. Technical protection mechanisms which hinder preservation. Encryption is the obvious one. LOC mention also any digital format which is inextricably bound to its physical carrier.
In addition to the sustainability factors above, there are also quality and functionality factors which vary depending on the genre of content. When assessing text formats (say) you need to consider support for the integrity of layout, font, struture, navigation, diagrams etc, but when assessing still image formats you have to consider support for colour management, resolution, graphic effects and so on.
Balancing the factors
How do we weight these factors? After all, a specific format may score well on adoption, but poor on self-documentation. In the end this is a decision for each repository: the Library of Congress is not going to tell us what to do. However, the LOC has made its mind up as far as its own collections go. The results for the main content categories are:
Text with structural markup: OEPBS_1_2 or DTB
Text with page layout: PDF/A, other PDF subtypes but only those created from machine readable text
Text in word processor form: ODF, OOXML, and/or PDF/A (not .doc or anything else like that)
Bitmapped colour or greyscale still images, in order of preference, best first: uncompressed TIFF, TIFF/EP, JPEG 2000, PDF/A, PDF/X, JPEG, PNG, GIF, BMP. (Not Photoshop PSD, RAW, EPS etc.)
Vector-based still images: SVG_1_1 or SVG_1_2
For sound files, the LOC suggests the principle that fidelity characteristics (bitstream encoding) should be used as the primary consideration, and that the choice of file formats is actually secondary (though they give a huge long list of them here). In general:
- higher sampling rate (usually expressed as kHz, e.g., 96kHz) is preferred over lower sampling rate
- 24-bit sample word-length is preferred over shorter
- linear PCM (uncompressed) is preferred over compressed (lossy or lossless)
- higher data rate (e.g. 128 kilobits per second) is preferred over lower data rate for the same compression scheme and sampling rate
- AAC compression is preferred over MPEG-layer 2 (MP3) compression
- surround sound (5.1 or 7.1) encoding is only necessary if it’s essential to creator’s intent, otherwise uncompressed encoding in stereo is preferred.
The same principle applies to moving image files, so that larger picture sizes are preferred over smaller ones, higher bit rates are prefrred over lower, and so on. When it comes to actual file types the LOC prefers MPEG-2, MPEG-4-AVC, MPEG-4-V, MPEG-1.
The LOC site contains a great deal of analysis and commentary on specific file formats for differing content categories.
Initial, middle, and final-state formats
Traditionally repositories collect and preserve final-state versions of documents, although they sometimes collect drafts (initial states) in order to show stages in the creative process. The LOC website makes the interesting point that the middle state is often better for long term digital preservation. The middle state is the state of the record prior to publication: the example given is that of separate multi-track music recordings fresh from the studio, with all their metadata, prior to the final mix. The middle state actually contains the maximum amount of information. “These are likely to have higher quality than final-state formats, may be easier to manage for preservation, and may also be the focus of developing archiving approaches by industry.”