Identification of PDF preservation risks with VeraPDF and JHOVE

25 May 2023
Photo of a red toy robot and a similar looking blue toy robot in a boxing ring. Both robots face each other in a threatening stance.
"Rock 'em Sock 'em Robots Game" by Lorie Shaull, used under CC BY-SA 4.0, via Wikimedia Commons.

The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. In this post I’ll review to what extent such features can be detected using VeraPDF and JHOVE. It further builds on earlier work I did on this subject between 2012 and 2017.


Extracting text from EPUB files in Python

09 March 2023
Street scene showing crowd gathered around an open carriage in which a dentist performs a tooth extraction on a patient. Next to the patient a man is banging on a large drum.
Clockwork picture of an itinerant dentist performing an extraction in French rural scene, wood frame, metal workings, first half 19th century. Science Museum, London. Attribution 4.0 International (CC BY 4.0) (cropped from original).

This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.

So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.


Moving my Internet domains

20 February 2023
Icon of a donkey that is pulling a cart that has the words bitsgalore.org written on it. In the background the sun is shining.
Donkey, cart and sun icons licensed from the Noun Project.

I recently moved the two Internet domains I own away from the UK-based domain registrar I’d been using since 2004 to a EU-based registrar. While the actual domain transfer was fairly simple, finding a registrar that suited my specific situation turned out more difficult than expected. Leaving my old registrar also resulted in a surprise. It’s unlikely that my situation is unique, so I thought it would be useful to share my experiences in this blog post, and point to some useful online resources that I found along the way. The move also allowed me to make my domains up to date with (mostly security-related) modern internet standards. I’ll briefly address this in the final sections of this post. This includes some suggestions on how to make these optimizations work with a GitHub Pages-hosted sites like this one.


Writing yet another workflow tool for imaging portable media

23 January 2023
Photo of a laptop running the Ipmlab software. In the foreground is a removable USB floppy drive with some 3.5 inch floppies lying on top of it. To the right of the laptop is a vintage floppy storage box that contains more floppies.

In 2017 I wrote a blog post on Iromlab (an acronym for “Image and Rip Optical Media Like A Boss”), a custom-built software tool that streamlines imaging and ripping of optical media using an Acronova Nimbie disc robot. The KB has been using Iromlab since 2019 as part of an ongoing effort to preserve the information contained in its vast collection of legacy optical media. This project is expected to reach its completion later this year, but as demonstrated by this earlier inventory, our deposit collection also contains various other types of legacy media that are under threat of becoming inaccessible. Out of these, 3.5 inch floppy disks are the most common data carriers (after optical media), so it made sense to focus on these as a next step.

Using the existing Iromlab-based workflow as a starting point, I created a preliminary workflow tool that can be used for imaging our 3.5” floppies (and various other types portable media). In this post I’ll explain how this tool came about, and highlight some of the challenges I encountered during its development.


How to preserve your personal Twitter archive

20 November 2022
Painting of a whale, which is lifted above the beach by a swarm of red Twitter birds. In the foreground, an elderly couple watches the scene.
"Fail whale" by Kuni (ka-92) (license unknown), based on "Lifting a Dreamer" by Yiying Lu.

As a total collapse of Twitter is becoming more likely every day, many Twitter users have started to archive their personal data from the platform while it still exists. Twitter allows you to request and download your personal archive. Even though this works well, and the quality of the archive is surpringly good, it does have some shortcomings. The result of these shortcomings will be that some information in the archive (e.g. on followed accounts and followers) will be lost once Twitter ceases to exist. Some other information (in particular full, unshortened URLs) is included in the archive, but it is not easily accessible from the main HTML interface. The good news is, that some excellent tools exist to fix these shortcomings.

In this post I outline the workflow I used to preserve my own Twitter archive, and while doing so I also provide some background information on the shortcomings of the Twitter archive. Since some of these steps may, at first sight, be a little daunting for less tech-savvy readers, I’ve tried to provide step-by-step instructions where possible.



Search

Tags

Archive

2024

November

October

March

2023

June

May

March

February

January

2022

November

June

April

March

2021

September

February

2020

September

June

April

March

February

2019

September

April

March

January

2018

July

April

2017

July

June

April

January

2016

December

April

March

2015

December

November

October

July

April

March

January

2014

December

November

October

September

August

January

2013

October

September

August

July

May

April

January

2012

December

September

August

July

June

April

January

2011

December

September

July

June

2010

December

Feeds

RSS

ATOM