VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset

29 June 2023
Vintage lithograph circus poster that shows a circus ring. In the front is a woman in a red dress, standing on horseback. Behind her there are more horses, with a variety of circus artists, including acrobats and jugglers, performing on horseback as well. In the background acrobats are walking on a tightrope.
"The Barnum & Bailey greatest show on earth". Used under CC BY-BY 2.0, via Boston Public Library.

Last month I wrote this post, which addresses the use of JHOVE and VeraPDF for identifying preservation risks in PDF files. In the concluding section I suggested that VeraPDF’s parse status might be used as a rough “validity proxy” to identify malformed PDFs. But does VeraPDF’s parse status actually have any predictive value for rendering? And how does this compare to what JHOVE tells us? This post is a first attempt at answering these questions, using data from the Synthetic PDF Testset for File Format Validation by Lindlar, Tunnat and Wilson.


Identification of PDF preservation risks with VeraPDF and JHOVE

25 May 2023
Photo of a red toy robot and a similar looking blue toy robot in a boxing ring. Both robots face each other in a threatening stance.
"Rock 'em Sock 'em Robots Game" by Lorie Shaull, used under CC BY-SA 4.0, via Wikimedia Commons.

The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. In this post I’ll review to what extent such features can be detected using VeraPDF and JHOVE. It further builds on earlier work I did on this subject between 2012 and 2017.


Extracting text from EPUB files in Python

09 March 2023
Street scene showing crowd gathered around an open carriage in which a dentist performs a tooth extraction on a patient. Next to the patient a man is banging on a large drum.
Clockwork picture of an itinerant dentist performing an extraction in French rural scene, wood frame, metal workings, first half 19th century. Science Museum, London. Attribution 4.0 International (CC BY 4.0) (cropped from original).

This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.

So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.


Moving my Internet domains

20 February 2023
Icon of a donkey that is pulling a cart that has the words bitsgalore.org written on it. In the background the sun is shining.
Donkey, cart and sun icons licensed from the Noun Project.

I recently moved the two Internet domains I own away from the UK-based domain registrar I’d been using since 2004 to a EU-based registrar. While the actual domain transfer was fairly simple, finding a registrar that suited my specific situation turned out more difficult than expected. Leaving my old registrar also resulted in a surprise. It’s unlikely that my situation is unique, so I thought it would be useful to share my experiences in this blog post, and point to some useful online resources that I found along the way. The move also allowed me to make my domains up to date with (mostly security-related) modern internet standards. I’ll briefly address this in the final sections of this post. This includes some suggestions on how to make these optimizations work with a GitHub Pages-hosted sites like this one.


Writing yet another workflow tool for imaging portable media

23 January 2023
Photo of a laptop running the Ipmlab software. In the foreground is a removable USB floppy drive with some 3.5 inch floppies lying on top of it. To the right of the laptop is a vintage floppy storage box that contains more floppies.

In 2017 I wrote a blog post on Iromlab (an acronym for “Image and Rip Optical Media Like A Boss”), a custom-built software tool that streamlines imaging and ripping of optical media using an Acronova Nimbie disc robot. The KB has been using Iromlab since 2019 as part of an ongoing effort to preserve the information contained in its vast collection of legacy optical media. This project is expected to reach its completion later this year, but as demonstrated by this earlier inventory, our deposit collection also contains various other types of legacy media that are under threat of becoming inaccessible. Out of these, 3.5 inch floppy disks are the most common data carriers (after optical media), so it made sense to focus on these as a next step.

Using the existing Iromlab-based workflow as a starting point, I created a preliminary workflow tool that can be used for imaging our 3.5” floppies (and various other types portable media). In this post I’ll explain how this tool came about, and highlight some of the challenges I encountered during its development.



Search

Tags

Archive

2024

December

November

October

March

2023

June

May

March

February

January

2022

November

June

April

March

2021

September

February

2020

September

June

April

March

February

2019

September

April

March

January

2018

July

April

2017

July

June

April

January

2016

December

April

March

2015

December

November

October

July

April

March

January

2014

December

November

October

September

August

January

2013

October

September

August

July

May

April

January

2012

December

September

August

July

June

April

January

2011

December

September

July

June

2010

December

Feeds

RSS

ATOM