Restoring Liesbet's Virtual Home, a digital treasure from the early Dutch web

30 June 2020

Liesbet door — Original artwork copyright ©Liesbet Zikkenheimer.

In 2019, Dutch telecommunications company KPN announced its plans to phase out its subsidiary XS4ALL, which is one of the oldest internet service providers in the Netherlands. With this decision, thousands of homepages and personal web sites that are hosted under the XS4ALL domain are at risk of disappearing forever. The web archiving team of the National Library of the Netherlands (KB) has started an initiative to rescue a selection of these homepages, which includes some of the oldest born-digital publications of the Dutch web. This blog post describes an attempt to rescue and restore one of the oldest and most unique homepages from this collection: Liesbet’s Virtual Home (Liesbet’s Atelier), the personal web site of Dutch Internet pioneer Liesbet Zikkenheimer, which has a history that goes back to 1995. First I give some background information about XS4ALL, and the KB-led rescue initiative. Then I move on to the various (mostly technical) aspects of restoring Liesbet’s Virtual Home. Finally, I address the challenges of capturing the restored site to an ingest-ready WARC file.

ISO/IEC TS 22424 standard on EPUB3 preservation

30 April 2020

“The Scream”, undated drawing by Edvard Munch, Bergen Kunstmuseum, Public domain.

Earlier this week Library of Congress added a new entry on the standard “Digital publishing — EPUB3 preservation” (ISO/IEC TS 22424) to its excellent Digital Formats web site. This standard was developed by the ISO Technical Committee on Document description and processing languages, and was published in January this year (2020).

According to its authors, “the ISO/IEC TS 22424 series supports long-term preservation of EPUB publications via a dual strategy”. The standard is made up of 2 parts, which are sold as separate documents on the ISO website:

In this blog post I will take a closer look at both parts of the standard. What do they purport, what is their scope, and to what degree do they live up to their stated promises? Readers who are only interested in the most important findings may want to jump to the “Summary and discussion” section at the end of this post.

Does Microsoft OneDrive export large ZIP files that are corrupt?

11 March 2020

“Broken zip £12.50” by dichoecho, used under CC BY / Cropped from original.

We recently started using Microsoft OneDrive at work. The other day a colleague used OneDrive to share a folder with a large number of ISO images with me. Since I wanted to work with these files on my Linux machine at home, and no official OneDrive client for Linux exists a this point, I used OneDrive’s web client to download the contents of the folder. Doing so resulted in a 6 GB ZIP archive. When I tried to extract this ZIP file with my operating system’s (Linux Mint 19.3 MATE) archive manager, this resulted in an error dialog, saying that “An error occurred while loading the archive”:

The output from the underlying extraction tool (7-zip) reported a “Headers Error”, with an “Unconfirmed start of archive”. It also reported a warning that “There are data after the end of archive”. No actual data were extracted whatsoever. This all looked a bit worrying, so I decided to have a more in-depth look at this problem.

Offline digital data carriers in the KB deposit collection

20 February 2020

Following earlier work on the preservation of optical media and data tapes, I recently got a request to make an inventory of offline digital data carriers in the KB’s deposit collection. The goal was to obtain approximate figures on the various carrier types in the collection. This was partially prompted by a project on at-risk digital heritage on physical carriers by the Dutch Digital Heritage Network (NDE) that the KB is participating in. This blog post presents the results.

Web domain geolocation and spatial analysis with QGIS

11 February 2020

A few weeks ago one of my web archiving colleagues approached me with an interesting question. From a list of Dutch web domains, he wanted to identify the (Dutch) province in which each domain is hosted. He was particularly interested in domains hosted in the province of Friesland. After some experimentation I was able to answer this question using a two-step procedure:

Geo-locate the web domains using a custom Python script.
Combine the results of the geolocation exercise with openly available geographical data using QGIS, an open-source geographical information system (GIS).

Even though the outcome of the analysis is not particularly interesting, I imagine both the geolocation methodology and the GIS analysis steps might be useful to others. So, this blog post is primarily intended as a tutorial that gives a walkthrough of the steps I followed.

« Newer 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Older »

Search

Tags

Archive

June

May

April

December

November

October

March

June

May

March

February

January

November

June

April

March

September

February

September

June

April

March

February

September

April

March

January

July

April

July

June

April

January

December

April

March

December

November

October

July

April

March

January

December

November

October

September

August

January

October

September

August

July

May

April

January

December

September

August

July

June

April

January

December

September

July

June

December

Issues

Social

Feeds