31 January 2014
One of my first blogs here covered an evaluation of a number of format identification tools. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (FITS, DROID, Fido and JHOVE2) failed to even run when executed with their associated launcher script. In many cases the Windows launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!
-
DROID
-
Fido
-
FITS
-
JHOVE
-
JHOVE2
-
rant
27 January 2014
This blog follows up on three earlier posts about detecting preservation risks in PDF files. In part 1 I explored to what extent the Preflight component of the Apache PDFBox library can be used to detect specific preservation risks in PDF documents. This was followed up by some work during the SPRUCE Hackathon in Leeds, which is covered by this blog post by Peter Cliff. Then last summer I did a series of additional tests using files from the Adobe Acrobat Engineering website. The main outcome of this more recent work was that, although showing great promise, Preflight was struggling with many more complex PDFs. Fast-forward another six months and, thanks to the excellent response of the Preflight developers to our bug reports, the most serious of these problems are now largely solved. So, time to move on to the next step!
08 October 2013
My previous blog Assessing file format risks: searching for Bigfoot? resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.
30 September 2013
Last week someone pointed my attention to a recent iPres paper by Roman Graf and Sergiu Gordea titled “A Risk Analysis of File Formats for Preservation Planning”. The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:
-
Collect and aggregate information on file formats from data sources such as PRONOM, Freebase and DBPedia
-
Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format’s complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.
This has resulted in the “File Format Metadata Aggregator” (FFMA), which is an expert system aimed at establishing a “well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts”.
19 August 2013
Like many other organisations that are using JPEG 2000, the KB produces two representations of most of its digitised content (newspapers, books, periodicals):
- a high-quality, losslessly compressed JP2 that is the archival master;
- a lesser-quality, lossily compressed JP2 that is used as an access image (this is used for e.g. our newspapers website).
The majority of our digitisation work is contracted out to external suppliers, and both master and access images are typically derived from from a parent (TIFF) image, which is converted to JP2 using the settings for master and access images, respectively. This means that we’re not currently using the archival masters for producing derived images. However, there may be a need for this at some point in the future. For instance, we may need higher quality access images, or access images that give better performance in our access environment. Because of this, I was asked to take a further look into ways to derive access JP2s directly from our archival masters.
In this blog post I’ll be sharing some preliminary findings of this work, which may be of interest to other JPEG 2000 practitioners as well. All images and test results that I’ll be showing along the way are available from this Github repository, so you can have a go at these data yourself, if you’re so inclined.