There a handful of different strategies for archiving websites, of which a web-served archive is just one. The best example of a web-served archive is the Internet Archive .

Tech

The IA stores files of websites in warc container files. A warc file keeps a sequence of web pages and headers, the headers describing the content and length of each harvested page. A warc also contains secondary content, such as assigned metadata and transformations of original files.

Each record has an offset which is stored in an index ordered by URI. This means that it should be possible to rapidly extract individual files based on their URI. The selected files then get sent to a web server which forwards them to the client.

Doing it this way allows the naming of individual web pages to be preserved. It also scales up pretty well (the IA has a colossal amount of information).

Harvesting

The IA collects web pages, moving images, texts, audio and software. Its crawls are carried out by Alexa Internet, which provides snapshots after a 6 month-ish delay. The IA’s approach is extensive archiving rather than intensive, ie. it goes for breadth rather than depth. Lots of pages get left out. The IA’s FAQ says “we do not archive pages that require a password to access, pages tagged for “robot exclusion” by their owners, pages that are only accessible when a person types into and sends a form, or pages on secure servers. If a site owner properly requests removal of a Web site we will exclude that site from the Wayback Machine.”

Much of the deep web lies uncaptured. The IA also has difficulties with some dynamic content, such as Javascript.

Storage

The IA used to use DLT tape. Currently they use hundreds of x86 servers running Linux, but they are moving towards dedicated petabyte servers (one million gigabytes). There is a cool photo of a PetaBox here.

So, does it work?

I used to have an old personal homepage, which I finally deleted in 2005 on grounds of immaturity. It occurred to me to check whether Alexa Internet’s robots had harvested it. And they had!

ia1.jpg

I had originally set up my homepage in the spring of 1994. I think the last time I updated the site was in September 1998, after which I lost interest in it. The earliest copy in the Internet Archive’s index dates from December 2002, six years after the IA had itself been set up. There are then about a dozen captures of the unchanging site until April 2005 after which it disappears.

The IA had not captured all of the images, just some of them, and it seemed to be pretty haphazard as to which ones survived and which ones didn’t. It was also immensely slow (I had a few Gateway Timeouts). But it coped OK with the frames which I had used back in the 1990s to organise the site.

Games

On a more frivolous level I was pleased to see that their moving image collections currently include a film of a Marathon Infinity speedrun on the Total Carnage setting.

ami.jpg

There are also speedruns of the original PlayStation Tomb Raider game, still the most atmospheric game of all time (boy, that dates me), and of the original PS Metal Gear Solid, the game which for me most imaginatively broke down the fourth wall.

Will it last forever?

No one knows, of course. One huge problem is that site owners can ask for their own content to be removed. The Scientologists have also had other people’s content removed from the IA which they declared was defamatory about them (apparently). This flags the issue that there is much more to digital preservation than just file format obsolescence and media decay.

Advertisements