The problems of domain crawls

  • The legal approach is usually a risk-based one, ie. harvest the content and then take it down if the creator objects. If the creator starts a court case then the costs of this approach could be very expensive.
  • The approach has no basis in law. Copying any site without the content creator’s permission is an illegal act.
  • There is no guarantee that a site discoverable today in the repository will still be there in the future, as the content creator may have requested its removal.
  • There is no 1:1 engagement with the content creators, frequently no engagement at all.
  • The crawl ignores lots of valauble content. Not everything relevant to UK web history actually has a “.uk” domain.
  • Domain crawls are slow, and miss much of the web’s fleeting, at-risk, or semantic content.

The problems of active selection

  • The selective approach is usually a permissions-based one, ie. approach the content creator first and ask for permission to archive. But this demands engagement with the creator, which is time-consuming, and which in turn drives the policy to become even more selective than what the repository may originally have envisaged. So the result is usually small-scale harvesting.
  • Creators may not understand the purpose or urgency of archiving.
  • Creators may say No, in which case the efforts made to engage with them have been fruitless.
  • Many sites are not selected.
  • The repository may not have the resources to re-evaluate selection decisions. Therefore, once a site has been rejected, it may continue to be rejected, even though its content has changed.
  • The repository needs to implement a policy on whether to continue archiving a site in which the content accruals stop being useful. But this constant overview over the harvesting schedule requires resources.
Advertisements