You are currently browsing the category archive for the ‘Uncategorized’ category.

The WARC format for web archiving is now ISO 28500:2009. The format is used by the Internet Archive.

Here’s the release from the Library of Congress:

The International Internet Preservation Consortium is pleased to announce the publication of the WARC file format as an international standard: ISO 28500:2009, Information and documentation — WARC file format.  [http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most appropriate ways to collect and keep track of World Wide Web material using web-scale tools such as web crawlers. At the same time, these organizations were concerned with the requirement to archive very large numbers of born-digital and digitized files. A need was for a container format that permits one file simply and safely to carry a very large number of constituent data objects (of unrestricted type, including many binary types) for the purpose of storage, management, and exchange. Another requirement was that the container need only minimal knowledge of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is an extension of the ARC format [http://www.archive.org/web/researcher/ArcFileFormat.php ], which has been used since 1996 to store files harvested on the web. WARC format offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for every contained file, the management of duplicates and of migrated records, and the segmentation of the records. WARC files are intended to store every type of digital content, either retrieved by HTTP or another protocol.

The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium [http://netpreserve.org/ ], whose core mission is to acquire, preserve and make accessible knowledge and information from the Internet for future generations. IIPC Standards Working Group put forward to ISO TC46/SC4/WG12 a draft presenting the WARC file format. The draft was accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener, collaborated closely with IIPC experts to improve the original draft. The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the WARC format. It will help web archiving entering into the mainstream activities of heritage institutions and other branches, by fostering the development of new tools and ensuring the interoperability of collections. Several applications are already WARC compliant, such as the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the WARC tools [http://code.google.com/p/warc-tools/ ] for data management and exchange, the Wayback Machine [http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX [http://archive-access.sourceforge.net/projects/nutch/ ] and other search tools [http://code.google.com/p/search-tools/ ] for access. The international recognition of the WARC format and its applicability to every kind of digital object will provide strong incentives to use it within and beyond the web archiving community.

———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

The reason why I’ve been away from digi preservation for so long is that I’ve been managing the move of our paper archives from one repository to another.  The move itself has gone more smoothly than I dared to hope: everything happened on schedule, the ICT didn’t let me down, all the boxes fitted their new locations so my sums must have been accurate enough… it’s taken three weeks to move a mile of paper and parchment archives.

Our understanding is that we’re the first people to control a UK local authority repository move with barcodes. It’s taken 2.5 years of preparation, mainly spent in getting all our barcode data onto CALM, but the result this week is that all we had to do was a massive zap of all the barcodes in the building (that’s taken 48 hours), upload the data into CALM’s locations module, and voila! – we now know where everything is.

Boxes hundreds of em

Above: boxes in the new repository.  At the old record office we had boxes of different sizes and formats scattered throughout the building. In the new repository we have been very strict in storing boxes purely by format even if it means splitting collections up. We’re relying totally on the barcodes to find them.

Cantilevers

Rolled maps in linen bags. Every single individual package, whether it’s a roll, a box, a freestanding volume or a folder in a drawer, has its own barcode.

James Dear

Even the 19th century portrait which we have on deposit has its own barcode!

Zapping

Here one of my colleagues is zapping the boxes on their new shelves. First we zap the shelf (all the shelves have their own barcodes) and then we zap the items on it. This raw data gets imported into Excel where lookup tables replace the numbers with human readable information (eg replacing “L012345” with “Bay R6 shelf D”). Then it all goes into CALM’s locations module, so that it links automatically with the documents’ catalogue entries.

The methodology took us months to work out, followed by two years of repackaging work and sticking barcodes on everything, just to result in 48 hours of zapping in the new repository.

Here’s some details about our barcode methodology in the National Archives’s RecordKeeping magazine (we’re on page 36).

It’s been a long, long project but it’s all gone smoothly and I feel rather chuffed to have managed the move in a new way. Back to digi preservation soon!


MLA East of England has published the report on Phase 2 of its Digital Preservation Regional Pilot Project (DARP 2). The report is available as a PDF here. Phase 2 was carried out by Bedfordshire County Council over the period September 2007-June 2008.

The project is of great use to UK local authority record offices, such as the one I work for, because it assesses the real world situation where outside organisations create digital records and then deposit them with local archive services. This is a different situation from that experienced by national archive organisations, which by and large deal with fewer record-creating organisations, and which therefore have more say over the sorts of records created. A UK local authority archives service typically deals with thousands of separate organisations and individuals, and has little or no say over file formats.

The aim of the DARP 2 project was therefore to survey a sample of these “typical depositors” to establish the reality behind this concern. Are organisations creating large numbers of electronic records for long term preservation, or are they still reliant on paper? How are they using digital records?

Bedfordshire and Luton Archives Service surveyed a range of organisations, including Parochial Church Councils, magistrates courts, town councils, parish councils, state and independent schools, and some businesses and charities. The survey was carried out with a questionnaire and with a follow-up interview.

Summary of DARP 2’s interesting results

“The overall picture was one of all or nothing in terms of understanding.” This is probably just as true of colleagues within the archives sector… Digital preservation sadly is not a subject which people can pick up a working knowledge of in their day to day activities, nor does it crop up very often in the media. You are either interested in it (in which case you will read up lots) or you are not (in which case you will know nothing). It is not like (say) gardening, where there is a whole spectrum of levels of involvement, from just weeding right through to plant breeding.

Most organisations still use paper. Some bodies stated that this was due to issues concerning the admissibility of digital records in court. Other organisations depend entirely on volunteers using home computers, using unsophisticated filing systems on old equipment. At least one organisation stated that electronic records were kept purely as backup for paper. Certainly it seems that many organisations regard paper as the best long term solution: more than half of all respondents archived their emails by printing them out and filing the hard copies.

The report itself states that “paper is still the medium of choice for record keeping – 85% of the bodies surveyed are printing out digital files… although computers offer very creative means of generating ways of populating and decorating the blank page, they are tending to be seen as tools for manipulating and storing documents not as the final means of storing and managing records.”

Only ten replies (out of 26) responded to the question concerning migration, and three respondents even stated that they did not understand the concept.

There is a problem with digital record keeping in state schools. It is remarkable that DARP found it difficult even to state schools in the project, or even to work out who at a school was responsible for record keeping.

No organisation thought that the record office would fail to deal with digital records.

The most popular backup medium was CD-R, closely followed by memory stick.

An excellent new blog.

“This blog is a place for ULCC’s Digital Archives staff to record information about the activities and projects they are involved with. ULCC’s Digital Archives department has been working for over a decade on digital archives, library and preservation projects and initiatives, including systems for the University of London, the National Archives, the British Library and the JISC. We hope that the blog will build an authentic journal of our work and a reliable reference and online memory for our own records – a less formal record than reports and newsletters. If any of the information in it is helpful to others working in the field, so much the better.”

Follow the link in the blogroll

This is running a program on a future computer which makes it emulate the hardware of an older computer, enabling software written for that computer to run, and therefore nothing ever needs to be migrated. Writing such software is not trivial (although in theory it only needs to be done once). AA: I think this is the Macintosh Lisa project?? You can also emulate specific applications (AA: I’m thinking of Marathon here), but this involves a huge amount of retro-engineering, if the format is proprietary. Stricly speaking, you are not actually running the original program at all, just something pretending to be the program (I’m not running Marathon, I’m running Aleph One).

Presumably the DOS command window in today’s Windows systems is an OS emulator. (Rich says it is.) And you can run the original 1981 VisiCalc spreadsheet software on it – as far as I’m aware the code used is still the original code, not a rewritten code (http://www.bricklin.com/history/vcexecutable.htm accessed 28.11.2007).

Emulation is attractive, as in theory it captures all aspects of the original file – the content, the formulae, their relationships, the behaviour, the apperance. But it is very difficult, not least because it all has to be worked out while the original platform is still active. You then need to preserve the emulator, the OS, the application installation files and the records. (So you need to remember to keep them.) It’s probably too much work for individual files like spreadsheets. It’s worth noting that emulators have only really been done for games, not for spreadsheet programs.

Visited December 2007. Loads of hardware but with no real captions or contextualisation yet. Tum te tum.

The biggie in the photos below is an ICL 2966 mainframe. ICL (International Computers Ltd) was a British manufacturer which brought out its 2900 range in 1974. The 2966s could run dual processors. They ran the VME (Virtual Machine Environment) operating system, which is now called OpenVME. ICL is now owned by Fujitsu, and the ICL branding has been dropped. Wikipedia says “as a creation of the mid-1970s, with no constraints to be compatible with earlier operating systems, VME is in many ways more modern in its architecture than today’s Unix derivatives (Unix was designed in the 1960s) or Windows (which started as an operating system for single-user computers, and still betrays those origins)… The most recent incarnations of VME run as a hosted subsystem, called superNova, within Microsoft Windows or Red Hat Enterprise Linux on Intel-based hardware.”

icl2966mainframe-c.jpg icl2966mainframe-d.jpg

This is cool, it shows the printed manual for the 2900 series. Unbelievably cheap looking and full of biro annotations.

icl2966mainframe-e-manual.jpg

Here’s the storage, a mixture of fixed and removable media.

icl2966mainframe-a.jpg icl2966mainframe-b.jpg

Here’s a printer which was nearby, together with a punch card writer and a card sorter.

printer.jpg punchcard-writer.jpg punchcard-sorter.jpg

Here’s Colossus

colossus-a.jpg colossus-b.jpg

PC magazine 1988

This is a cool image – shows 55 word processors being reviewed in the late 1980s. Things have standardised a lot since then. Image from http://digitalarchive.oclc.org/da/ViewObject.jsp?objid=0000070519&reqid=4345

Apparently this is the real thing for records management. Applies to all records, paper or digital. Probably need to find out a bit more about this.

A test posting from the PDA