You are currently browsing the category archive for the ‘Books’ category.

deegantanner.jpg Review of and notes from Julien Masanes’s chapter on web archiving in Deegan and Tanner’s Digital Preservation (2006). This book also contains a chapter by Elisa Mason looking at some web archiving case studies. Available at Amazon. 

Masanes asserts that the web is a dynamic information space; web archiving consists of capturing a local, limited version of one small part of this space, and freezing it. This is well tricky, for various reasons. 

  • Much of the web’s content is dynamic: a URI can link not only to a simple flat page, but also to a virtual page, which is really a set of parameters which is then interpreted to generate the displayed content.
  • The actual dynamics of these dynamic pages evolve rapidly, because it is so easy for people to code up a new protocol and share it around.
  • The concept of a definitive or final version does not apply to the web: web documents are always prone to deletion or updating. Any preservation undertaken is therefore by its nature a sampling done at a specific time.
  • The navigation paths are embedded within the documents themselves, so a web archive has to be constructed in such a way that this mechanism still works for preserved pages.
  • Automated crawlers or spiders often take a few days to navigate large websites. This can lead to temporal incoherence, in that a page linked from the home page may be different from how it was when the home page itself was crawled.
  • Size. Size. Size. An organisation’s website can easily run into hundreds of thousands of pages, comprising billions of individual objects such as graphics etc.

Read the rest of this entry »

Advertisements

sl1.jpgPublisher: John Wiley & Sons (23 Jan 2004) ; ISBN-10: 0471453803. Available from Amazon.

OK, so it’s a book about digital security, not about digital preservation. But if there was a book on digital preservation as well written as this then I doubt we would have any problems in getting our message across. Well worth reading.

There are two particular aspects which jumped out as being indirectly relevant to digital preservation concerns, both to do with the interaction of humans with computers:

There is no such thing as a computer system; there are only computer-with-human systems. Well I’m paraphrasing Schneier there, but it’s the sort of thing he would say, and he argues that it is the case. It is pointless to buy a digital security package and then leave the password on a Post-it note gummed to the monitor. It is pointless to invest in 128-bit encryption if the password you choose will be your cat’s name. It is pointless to set up a cutting edge firewall if you pay your staff so little that they will be bribed by a guy in the pub to burn the data onto a CD anyway. Schneier is making the point that an ICT system, by itself, is meaningless: it exists in a world full of humans, and we need to make sure the human elements are as trustworthy as the technical ones. This strikes me as being indirectly relevant to digital preservation. We argue lots about technical aspects – emulation, migration, file formats, metadata, XML etc – but we need to train ourselves up in human pyschology and understand exactly how people will interact with our proposed systems.

Humans don’t do work on data; only progams do. (Another paraphrase there.) Schneier’s explicit point is about encryption, such as PGP. Very often you read statements like “Alice encrypts a message with Bob’s public key, which Bob can then decrypt because he has his own private key.” But in reality, nothing of the sort ever happens. Instead Alice presses a key on her computer. An application then encrypts the message. Nor does Bob decrypt. Instead he presses a key on his own computer, and the computer does the decrypt. Alice is trusting her computer, her OS and the app to do their job, and trusting that the encryption software company haven’t rigged up a backdoor. Bob, too, is trusting a whole load of people that he has never met, purely because he has bought their software.  

There is an analogy here with digital preservation, as Schneier’s point can be extrapolated across to migration and emulation. When someone says “we can emulate X on Y” what they actually mean is “there is company claiming that X can be emulated on Y, and I am trusting them.” Or: “there is a company claiming  that their software can automatically migrate 1,000,000 files from file format X to file format Y with no loss of information content, and I am trusting them.” Or: “there is a company claiming that their checksum software proves fixity in refeshing data, and I am trusting them.” Ultimately we do not trust the technology, we have to trust the people behind the technology.

ksmith1.jpgPlanning and Implementing Electronic Records Management: a practical guide  (Hardcover) by Kelvin Smith (Author), Publisher: Facet Publishing (Oct 2007), ISBN-10: 185604615X. Available from Amazon. Chapter 8 concerns Preservation, especially ‘long-term’, which is defined (p.130) as being ‘greater than one generation of technology.’ Unlike other books I have read so far, Smith’s approach is largely standards-based.

Smith begins by making the interesting point that there is still “a certain amount of distrust” of electronic records (p.129), and that people still seem to be happier with paper for preservation. This is no longer acceptable.

Smith then looks at four core challenges (authenticity, reliability, integrity and useability) in the light of ISO 15489. Authenticity is not an either/or thing: there is a sliding scale of authenticity, and the higher of number of requirements which have been met, the stronger the presumption of authenticity. Likewise, integrity does not mean that a record is unchanged: it means that only authorized and appropriate changes have been made.

Other standards relevant to digital preservation are

  • ISO 17799 Information security management (a revision of BS 7799)
  • BIP 0008 Code of practice for legal admissability etc of electronic information
  • e-GIF the UK e-Government Interoperability Framework
  • OAIS Open Archival Information System
  • BS 4783 Recommended environmental storage for digital media
  • BS 25999 Business continuity best practice

File formats

Smith says there is a case for creating the records properly in a sustainable format to begin with. [See I have a cup of coffee. AA] It’s more cost-effective for an organisation to take preservation factors into account at the beginning of the life cycle than halfway along. TNA have guidance on selecting good file formats, and e-GIF is useful here too.

But if you decide to create records in a short term or proprietary format then you need to mull over migration vs. emulation. Smith summarises the usual pros and cons. The only interesting additional points he makes are that (a) migration should always support business needs as well as preserve record content, ie. you don’t want to migrate to a format you cannot directly search or copy from, and (b) any migration strategy should integrate with existing corporate policies and procedures (especially BIP 0008). His RM policies mindset is coming through clearly here.

Databases

Smith’s book is the only one I have read so far to include a section on database preservation, and it’s short (less than a page). Preservation depends really on what sort of database it is: in some DBs old data is overwritten by new data, while in others data is never removed or overwritten. Similarly, some DBs are time or project-limited (such as surveys) while others carry on indefinitely. The usual approach is a simple all-or-nothing snapshot of the data which is then converted to some standard form rather than its native one. In addition some systems preserve an audit trail alongside, capturing every alteration made to records.

Implementing the preservation strategy

Smith then finishes the chapter with an excellent three page summary of the key steps you need to undertake, practically, to implement a strategy. A 6-point summary of his 11-point summary:

  • work with records creators and archivists to appraise and select records for permanent preservation
  • identify the right people within your own organisation to carry out preservation
  • decide on a technical preservation approach, and work with ICT people to see that it is carried out and properly tested
  • verify that the approach has worked ok. And keep a temporary backup of everything until you know it has worked
  • keep metadata and documentation on everything
  • keep all the stakeholders in the loop.

He also recommends getting authority to destroy the original e-records when the preservation has been carried out successfully, ie that the records are usable, authentic and reliable.

oais1.jpgNoted from OAIS.

Section 5 of the OAIS model explicitly addresses practical approaches to preserve digital information. The model immediately ties its colours to the migration mast. “No matter how well an OAIS manages its current holdings, it will eventually need to migrate much of its holdings to different media and/or to a different hardware or software environment to keep them accessible” (5.1). Emulation is mentioned later on, but always with some sort of proviso or concern attached.

Migration 

OAIS identifies three main motivators behind migration (5.1.1). These are:

  • keeping the repository cost-effective by taking advantage of new technologies and storage capabilities
  • staying relevant to changing consumer expectations
  • simple media decay.

OAIS then models four primary digital migration types (5.1.3). In order of increasing risk of information loss, they are:

  • refreshment of the bitstream from old media to newer media of the exact same type, in such a way that no metadata needs to be updated. Example: copying from one CD to a replacement CD.
  • replication of the bitstream from old media to newer media, and for which some metadata does need updating. The only metadata change would be the link between the AIP’s own unique ID and the location on the storage of the AIP itself (the “Archival Storage mapping infrastructure”). Example: moving a file from one directory on the storage to another directory.       
  • repackaging the bitstream in some new form, requring a change to the Packaging Information. Example: moving files off a CD to new media of different type.
  • transformation of the Content Information or PDI, while attempting to preserve the full information content. This last one is the one we traditionally term “migration,” and is the one which poses the most risk of information loss.

In practice there might be mixtures of all these. Transformation is the biggie, and section 5.1.3.4 goes into it in some detail. Within transformation you can get reversible transformation, such as replacing ASCII codes with UNICODE codes, or using a lossless compression algorithm; and non-reversible transformation, where the two representations are not semantically equivalent. Whether NRT has preserved enough information content may be difficult to establish.

Because the Content Information has changed in a transformation, the new AIP qualifies as a new version of the previous AIP. The PDI should be updated to identify the source AIP and its version, and to describe what was done and why (5.1.4). The first version of the AIP is referred to as the original AIP and can be retained for verification of information preservation.

The OAIS Model also looks at the possibility of improving or upgrading the AIP over time. Strictly speaking, this isn’t a transformation, but is instead creating a new Edition of an AIP, with all its own associated metadata. This can be viewed as a replacement for a previous edition, but it may be useful to retain the previous edition anyway.

There’s also a Derived AIP, which could be a handy extraction of information aggregated from multiple AIPs. But this does not replace the earlier AIPs.

Emulation 

All that is fine for pure data. But what if the look and feel needs preserving too?

The easy thing to do in the short to medium term is simply to pay techies to port the original software to the new environment. But OAIS points out that there are hidden problems. It may not be obvious when the app runs that it is functioning incorrectly. Testing all possible output values is unlikely to be cost effective for any particular OAIS. Commercial bridges, which are commercially provided conversion SW packages transforming data to other forms with similar look and feel, suffer from the same problems, and in addition give rise to potential copyright issues.

“If source code or commercial bridges are not available and there is an absolute requirement for the OAIS to preserve the Access look and feel, the OAIS would have to experiment with “emulation” [sic] technology” (5.2.2).

Emulation of apps has even more problems than porting. If the output isn’t visible data but is something like (eg) sound, then it becomes nearly impossible to know whether the current output is exactly the same as the sound made 20 years ago on a different combination of app and environment. We would need to also record the sound in some other (non-digital!) form, to use as validation information.

A different approach would be to emulate the hardware instead. But the OAIS model has an excellent paragraph summarising the problems here, too, which I’ll quote in full (in 5.2.2.2):

 “One advantage of hardware emulation is the claim that once a hardware platform is emulated successfully all operating systems and applications that ran on the original platform can be run without modification on the new platform. However, this does not take into account dependencies on input/output devices. Emulation has been used successfully when a very popular operating system is to be run on a hardware system for which it was not designed, such as running a version of Windows™ on an Apple™ machine. However even in this case, when strong market forces encourage this approach, not all applications will necessarily run correctly or perform adequately under the emulated environment. For example, it may not be possible to fully simulate all of the old hardware dependencies and timings, because of the constraints of the new hardware environment. Further, when the application presents information to a human interface, determining that some new device is still presenting the information correctly is problematical and suggests the need to have made a separate recording of the information presentation to use for validation. Once emulation has been adopted, the resulting system is particularly vulnerable to previously unknown software errors that may seriously jeopardize continued information access. Given these constraints, the technical and economic hurdles to hardware emulation appear substantial.” 

Top stuff.

oais1.jpgNoted from OAIS.

The OAIS reference model groups all the various processes happening within an archive into six basic entities.

The Ingest entity receives the SIP and turns it into an AIP  for storage within the OAIS. This is the point at which a record may migrate from one file format to another. The Ingest people do detailed technical negotiating with Producers, create the Descriptive Information, check the record’s authenticity and so on.

Read the rest of this entry »

oais1.jpgNoted from OAIS. It strikes me that the concept of the Designated Community is central to how an OAIS even begins to think about its digital preservation. No one is saving records just for fun. They save records so that someone else will consult them at a later date. How we define ‘someone else,’ together with their interests and concerns, determines what features we need to preserve.

The atom unit here is the Consumer, which is defined in the Model (1.7.2) as “those persons or client systems who interact with OAIS services to find preserved information of interest and to access that information in detail. This can include other OAISs as well as internal OAIS persons or systems.” The Consumer is the entity which receives a DIP. Read the rest of this entry »

deegantanner.jpg Digital Preservation (Digital Futures Series) (Hardcover), by Marilyn Deegan (Editor), Simon Tanner (Editor). Hardcover: 260 pages; Publisher: Facet Publishing (18 Sep 2006); ISBN-10: 1856044858. Available at Amazon.

This is the most recent book published in the UK on digital preservation, and if I can speak from a parochial viewpoint for a bit, it’s nice to have a UK slant on things, with details given about UK projects. This means that Digital Preservation contains some practical information which is not present in Borghoff et al.

Read the rest of this entry »

oais1.jpgNoted from OAIS. Representation Information is a crucial concept, as it is only through our understanding of the Representation Information that a Data Object can be opened and viewed. The Representation Information itself can only be interpreted with respect to a suitable Knowledge Base.

The Representation Information concept is also inextricably tied in with the concept of the Designated Community, becuase how we define the Designated Community (and its associated Knowledge Base) determines how much Representation Information we need. “The OAIS must understand the Knowledge Base of its Designated Community to understand the minimum Representation Information that must be maintained… Over time, evolution of the Designated Community’s Knowledge Base may require updates to the Representation Information to ensure continued understanding” (2.2.1).

Read the rest of this entry »

borghoff.jpg Notes from Borghoff et al. Emulation has some notable advantages over migration, not least that it guarantees the greatest possible authenticity. The document’s original bitstream will always remain unchanged. All (!) we have to do is make sure that a working copy of the original app is available. As it’s impossible to keep the hardware running, we have to emulate the original system on new systems.

In theory there are no limitations on the format of the record- even dynamic behaviour should be preserved ok. But there are three massive worries with emulation: (a) can it be achieved at reasonable cost?, (b) is it possible to resolve all the copyright and legal issues involved in running software programs over decades? and (c) will the human-computer interface of the long term future be able to cope with the mouse-and-keyboard interface of today’s applications? The only realistic way to answer (c) would be to create a “vernacular copy” (p.78) but this strikes me as migration under a different name – just my own thought.

Read the rest of this entry »

oais1.jpgNoted from the OAIS model.

The OAIS model generally is not prescriptive, but it contains one section (3.1) where it lays out the responsibilities that an organisation must discharge in order to operate as an OAIS. These are:

1. Negotiate with Information Producers and accept appropriate information from them. This is simply the digital equivalent of what any record office does, though an OAIS in practice needs to gather much more information about a given accession, for PDI purposes.

2. Obtain sufficient control of the information to the level needed to ensure long term preservation. In a paper archive this is largely (a) keeping the stuff in a box and (b) capturing any access, copyright and legal restrictions as necessary. In a digital repository there is (c) the need to capture all the technical metadata for PDI purposes too. There may be additional legal issues as well, concerning authenticity, software copyright etc. “It is important for the OAIS to recognize the separation that can exist between physical ownership or possession of Content Information and ownership of intellectual property rights in this information” (3.2.2). The OAIS in practice may need to obtain authority to migtae Content Information to new representation forms.

3. Determine which groups should become the Designated Community able to understand the information. This is a more important task in a digital archive than a paper one, because how we define the DC determines what sort and level of Representation Information we need to keep alongside the Content data. The DC may change over time. OAIS suggests (3.2.3) that selecting a broader rather than a narrower definition helps long term preservation, as it means that more detailed RI is captured at an early stage, rather than leaving it until later.

4. Ensure that the preserved information is independently understandable to the DC, so that no further expert assistance is needed. [AA: this is an interesting point as paper repositories often work in the opposite way: the DC is so large (“the general public”) that a searchroom has to employ professional archivists and well-trained archive assistants to be on hand to explain the documents to the visitor.] The quality of being “independently understandable” will change over time. This means that RI will have to be updated as the years go by, even if the DC itself does not change.

5. Follow documented policies and procedures to ensure that (a) the information can be preserved against all reasonable contingencies, and (b) the information can be disseminated as authenticated copies of the original or as traceable back to the original. Section 3.2.5 suggests that these policies should be available to producers, consumers and any related repositories, and that the DC should be monitored so that the Content Information is still understandable to them. An OAIS should also have a long term technology usage plan.

6. Makes the preserved data available to the DC. An OAIS should have published policies on access and restrictions, so that the rights of all parties are protected.