In this post I explore some of the challenges I ran into while trying to estimate the quality level of JPEG images. By quality level I mean the percentage (1-100) that expresses the lossiness that was applied by the encoder at the last “save” operation. Here, a value of 1 results in very aggressive compression with a lot of information loss (and thus a very low quality), whereas at 100 almost no information loss occurs at all1.
More specifically, I focus on problems with ImageMagick’s JPEG quality heuristic, which become particularly apparent when applied to low quality images. I also propose a simple tentative solution, that applies some small changes to ImageMagick’s heuristic.
The KB has been using JP2 (JPEG 2000 Part 1) as the primary file format for its mass-digitisation activities for over 15 years now. Nevertheless, we still use uncompressed TIFF for a few collections. At the moment there’s an ongoing discussion about whether we should migrate those to JP2 as well at some point to save storage costs. Last week I ran a small test on a selection of TIFFs from those collections. I first converted them to JP2, and then verified whether no information got lost during the conversion. This resulted in some unexpected surprises, which turned out to be caused by the presence of thumbnail images in some of the source TIFFs. This post discusses the impact of having multiple images indide a TIFF on preservation workflows, and also provides some suggestions on how to identify such files.
Last month I wrote this post, which addresses the use of JHOVE and VeraPDF for identifying preservation risks in PDF files. In the concluding section I suggested that VeraPDF’s parse status might be used as a rough “validity proxy” to identify malformed PDFs. But does VeraPDF’s parse status actually have any predictive value for rendering? And how does this compare to what JHOVE tells us? This post is a first attempt at answering these questions, using data from the Synthetic PDF Testset for File Format Validation by Lindlar, Tunnat and Wilson.
The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. In this post I’ll review to what extent such features can be detected using VeraPDF and JHOVE. It further builds on earlier work I did on this subject between 2012 and 2017.
This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the SANE (Secure ANalysis Environment) project. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.
So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.