As I explained in the introduction of this earlier blog post, as part of our ongoing web archaeology project we are currently developing workflows for reading data from a variety of physical carrier formats. After the earlier work on data tapes and optical media, the next job was to image a small box with 3.5” floppy disks. Easy enough, and my first thought was to fire up Guymager and be done with it. This turned out to be less straightforward than expected, which led to the development of yet another workflow tool: diskimgr. In the remainder of this post I will first show the issues I ran into with Guymager, and then demonstrate how these issues are remedied by diskimgr.
In 2015 I wrote a blog post on preserving optical media from the command-line. Among other things, it suggested a rudimentary workflow for imaging CD-ROMs and DVDs using the readom and ddrescue tools. Even though we now have a highly automated workflow in place for bulk processing optical media from our deposit collection, readom and ddrescue still prove to be useful for various special cases that don’t quite fit into this workflow. The materials that we are currently receiving as part of our web archaeology activities are a good example. These are typically small sets of recordable CD-ROMs that are often quite old, and such discs are highly likely to be in less than perfect condition. For these cases a highly automated, iromlab-like workflow is unnecessary, and to some degree even impractical. Nevertheless, it would be useful to have some degree of automation, especially for things like the addition and packaging of associated metadata. This prompted the development of the omimgr workflow tool. In the the remainder of this blog post I will give an overview of omimgr.
When the KB web archive was launched in 2007, many sites from the “early” Dutch web had already gone offline. As a result, the time period between (roughly) 1992 and 2000 is seriously under-represented in our web archive. To improve the coverage of web sites from this historically important era, we are now looking into Web Archaeology tools and methods. Over the last year our web archiving team has reached out to creators of “early” Dutch web sites that are no longer online. It’s not uncommon to find that these creators still have boxes of offline carriers with the original source data of those sites. Using these data, we would (in many cases) be able to reconstruct the sites, similarly to how we reconstructed the first Dutch web index last year. Once reconstructed, they could then be ingested into our web archive.
In a previous blog post I showed how we resurrected NL-menu, the first Dutch web index. It explains how we recovered the site’s data from an old CD-ROM, and how we subsequently created a local copy of the site by serving the CD-ROM’s contents on the Apache web server. This follow-up post covers the final step: crawling the resurrected site to a WARC file that can be ingested into our web archive.
NL-menu was the first Dutch web index. The site was originally founded by a consortium of SURFnet, Dutch universities and the KB. From the mid-nineties onwards it was maintained solely by the KB. NL-menu was discontinued in 2004, after which the site was taken offline. In 2006 the domain name was sold to a private company that used it for hosting a web index that was partially based on the original NL-menu site.