You are currently browsing the tag archive for the ‘websites’ tag.

The problems of domain crawls

  • The legal approach is usually a risk-based one, ie. harvest the content and then take it down if the creator objects. If the creator starts a court case then the costs of this approach could be very expensive.
  • The approach has no basis in law. Copying any site without the content creator’s permission is an illegal act.
  • There is no guarantee that a site discoverable today in the repository will still be there in the future, as the content creator may have requested its removal.
  • There is no 1:1 engagement with the content creators, frequently no engagement at all.
  • The crawl ignores lots of valauble content. Not everything relevant to UK web history actually has a “.uk” domain.
  • Domain crawls are slow, and miss much of the web’s fleeting, at-risk, or semantic content.

The problems of active selection

  • The selective approach is usually a permissions-based one, ie. approach the content creator first and ask for permission to archive. But this demands engagement with the creator, which is time-consuming, and which in turn drives the policy to become even more selective than what the repository may originally have envisaged. So the result is usually small-scale harvesting.
  • Creators may not understand the purpose or urgency of archiving.
  • Creators may say No, in which case the efforts made to engage with them have been fruitless.
  • Many sites are not selected.
  • The repository may not have the resources to re-evaluate selection decisions. Therefore, once a site has been rejected, it may continue to be rejected, even though its content has changed.
  • The repository needs to implement a policy on whether to continue archiving a site in which the content accruals stop being useful. But this constant overview over the harvesting schedule requires resources.

deegantanner.jpg Review of and notes from Julien Masanes’s chapter on web archiving in Deegan and Tanner’s Digital Preservation (2006). This book also contains a chapter by Elisa Mason looking at some web archiving case studies. Available at Amazon. 

Masanes asserts that the web is a dynamic information space; web archiving consists of capturing a local, limited version of one small part of this space, and freezing it. This is well tricky, for various reasons. 

  • Much of the web’s content is dynamic: a URI can link not only to a simple flat page, but also to a virtual page, which is really a set of parameters which is then interpreted to generate the displayed content.
  • The actual dynamics of these dynamic pages evolve rapidly, because it is so easy for people to code up a new protocol and share it around.
  • The concept of a definitive or final version does not apply to the web: web documents are always prone to deletion or updating. Any preservation undertaken is therefore by its nature a sampling done at a specific time.
  • The navigation paths are embedded within the documents themselves, so a web archive has to be constructed in such a way that this mechanism still works for preserved pages.
  • Automated crawlers or spiders often take a few days to navigate large websites. This can lead to temporal incoherence, in that a page linked from the home page may be different from how it was when the home page itself was crawled.
  • Size. Size. Size. An organisation’s website can easily run into hundreds of thousands of pages, comprising billions of individual objects such as graphics etc.

Read the rest of this entry »

There a handful of different strategies for archiving websites, of which a web-served archive is just one. The best example of a web-served archive is the Internet Archive .


The IA stores files of websites in warc container files. A warc file keeps a sequence of web pages and headers, the headers describing the content and length of each harvested page. A warc also contains secondary content, such as assigned metadata and transformations of original files.

Each record has an offset which is stored in an index ordered by URI. This means that it should be possible to rapidly extract individual files based on their URI. The selected files then get sent to a web server which forwards them to the client.

Doing it this way allows the naming of individual web pages to be preserved. It also scales up pretty well (the IA has a colossal amount of information).

Read the rest of this entry »

Review in JSA vol 28 no 2, October 2007, by Caroline Shenton.

Brown’s purpose is to provide a broad overview of the subject, aimed at policy makers and webmasters, although Shenton points out that this book would be useful to ICT professionals too. Brown avoids discussing the details of technical methodologies, in order to prevent his book from becoming quickly outdated. Brown covers aspects such as the models and processes for selection, the main methods of web archiving, QA and cataloguing issues, legal issues, and some speculations about the future. Shenton thinks the only real aspect which Brown has missed is the issue of cost. Overall she rates it highly, so I’ll need to add this one to my reading list.