You are currently browsing the monthly archive for June 2008.
The problems of domain crawls
- The legal approach is usually a risk-based one, ie. harvest the content and then take it down if the creator objects. If the creator starts a court case then the costs of this approach could be very expensive.
- The approach has no basis in law. Copying any site without the content creator’s permission is an illegal act.
- There is no guarantee that a site discoverable today in the repository will still be there in the future, as the content creator may have requested its removal.
- There is no 1:1 engagement with the content creators, frequently no engagement at all.
- The crawl ignores lots of valauble content. Not everything relevant to UK web history actually has a “.uk” domain.
- Domain crawls are slow, and miss much of the web’s fleeting, at-risk, or semantic content.
The problems of active selection
- The selective approach is usually a permissions-based one, ie. approach the content creator first and ask for permission to archive. But this demands engagement with the creator, which is time-consuming, and which in turn drives the policy to become even more selective than what the repository may originally have envisaged. So the result is usually small-scale harvesting.
- Creators may not understand the purpose or urgency of archiving.
- Creators may say No, in which case the efforts made to engage with them have been fruitless.
- Many sites are not selected.
- The repository may not have the resources to re-evaluate selection decisions. Therefore, once a site has been rejected, it may continue to be rejected, even though its content has changed.
- The repository needs to implement a policy on whether to continue archiving a site in which the content accruals stop being useful. But this constant overview over the harvesting schedule requires resources.