Multi-image TIFFs, subfiles and image file directories

11 March 2024
Photograph that shows a hammer that is used to smash a screw into a piece of wood. On the left is a nail that is partially pushed into the same piece of wood, with an adjustable wrench immediately next to it.
"Confused, muddled, illogical". Used under Pixabay License.

The KB has been using JP2 (JPEG 2000 Part 1) as the primary file format for its mass-digitisation activities for over 15 years now. Nevertheless, we still use uncompressed TIFF for a few collections. At the moment there’s an ongoing discussion about whether we should migrate those to JP2 as well at some point to save storage costs. Last week I ran a small test on a selection of TIFFs from those collections. I first converted them to JP2, and then verified whether no information got lost during the conversion. This resulted in some unexpected surprises, which turned out to be caused by the presence of thumbnail images in some of the source TIFFs. This post discusses the impact of having multiple images indide a TIFF on preservation workflows, and also provides some suggestions on how to identify such files.


VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset

29 June 2023
Vintage lithograph circus poster that shows a circus ring. In the front is a woman in a red dress, standing on horseback. Behind her there are more horses, with a variety of circus artists, including acrobats and jugglers, performing on horseback as well. In the background acrobats are walking on a tightrope.
"The Barnum & Bailey greatest show on earth". Used under CC BY-BY 2.0, via Boston Public Library.

Last month I wrote this post, which addresses the use of JHOVE and VeraPDF for identifying preservation risks in PDF files. In the concluding section I suggested that VeraPDF’s parse status might be used as a rough “validity proxy” to identify malformed PDFs. But does VeraPDF’s parse status actually have any predictive value for rendering? And how does this compare to what JHOVE tells us? This post is a first attempt at answering these questions, using data from the Synthetic PDF Testset for File Format Validation by Lindlar, Tunnat and Wilson.


Identification of PDF preservation risks with VeraPDF and JHOVE

25 May 2023
Photo of a red toy robot and a similar looking blue toy robot in a boxing ring. Both robots face each other in a threatening stance.
"Rock 'em Sock 'em Robots Game" by Lorie Shaull, used under CC BY-SA 4.0, via Wikimedia Commons.

The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. In this post I’ll review to what extent such features can be detected using VeraPDF and JHOVE. It further builds on earlier work I did on this subject between 2012 and 2017.


Extracting text from EPUB files in Python

09 March 2023
Street scene showing crowd gathered around an open carriage in which a dentist performs a tooth extraction on a patient. Next to the patient a man is banging on a large drum.
Clockwork picture of an itinerant dentist performing an extraction in French rural scene, wood frame, metal workings, first half 19th century. Science Museum, London. Attribution 4.0 International (CC BY 4.0) (cropped from original).

This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.

So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.


Moving my Internet domains

20 February 2023
Icon of a donkey that is pulling a cart that has the words bitsgalore.org written on it. In the background the sun is shining.
Donkey, cart and sun icons licensed from the Noun Project.

I recently moved the two Internet domains I own away from the UK-based domain registrar I’d been using since 2004 to a EU-based registrar. While the actual domain transfer was fairly simple, finding a registrar that suited my specific situation turned out more difficult than expected. Leaving my old registrar also resulted in a surprise. It’s unlikely that my situation is unique, so I thought it would be useful to share my experiences in this blog post, and point to some useful online resources that I found along the way. The move also allowed me to make my domains up to date with (mostly security-related) modern internet standards. I’ll briefly address this in the final sections of this post. This includes some suggestions on how to make these optimizations work with a GitHub Pages-hosted sites like this one.



Search

Tags

Archive

2024

March

2023

June

May

March

February

January

2022

November

June

April

March

2021

September

February

2020

September

June

April

March

February

2019

September

April

March

January

2018

July

April

2017

July

June

April

January

2016

December

April

March

2015

December

November

October

July

April

March

January

2014

December

November

October

September

August

January

2013

October

September

August

July

May

April

January

2012

December

September

August

July

June

April

January

2011

December

September

July

June

2010

December

Feeds

RSS

ATOM