Generating lossy access JP2s from lossless preservation masters
At the KB we’ve been using JP2 (JPEG 2000 Part 1) as our primary image format for digitised newspapers, books and periodicals since 2007. The digitisation work is contracted out to external vendors, who supply the digitised pages as losslessly compressed preservation masters, as well as lossily compressed access images that are used within the Delpher platform.
Right now the KB is in the process of migrating its digital collections to a new preservation system. This prompted the question whether it would be feasible to generate access JP2s from the preservation masters in-house at some point in the future, using software that runs inside the preservation system1. As a first step towards answering that question, I created some simple proof of concept workflows, using three different JPEG 2000 codecs. I then tested these workflows with preservation master images from our collection. The main objective of this work was to find a workflow that both meets our current digitisation requirements, and is also sufficiently performant.
Master and access requirements
The following table lists the requirements of our preservation master and access JP2s:
Parameter | Value (master) | Value (access) |
---|---|---|
File format | JP2 (JPEG 2000 Part 1) | JP2 (JPEG 2000 Part 1) |
Compression type | Reversible 5-3 wavelet filter | Irreversible 7-9 wavelet filter |
Colour transform | Yes (only for colour images) | Yes (only for colour images) |
Number of decomposition levels | 5 | 5 |
Progression order | RPCL | RPCL |
Tile size | 1024 x 1024 | 1024 x 1024 |
Code block size | 64 x 64 (26 x 26) | 64 x 64 (26 x 26) |
Precinct size | 256 x 256 (28) for 2 highest resolution levels; 128 x 128 (27) for remaining resolution levels | 256 x 256 (28) for 2 highest resolution levels; 128 x 128 (27) for remaining resolution levels |
Number of quality layers | 11 | 8 |
Target compression ratio layers | 2560:1 [1] ; 1280:1 [2] ; 640:1 [3] ; 320:1 [4] ; 160:1 [5] ; 80:1 [6] ; 40:1 [7] ; 20:1 [8] ; 10:1 [9] ; 5:1 [10] ; - [11] | 2560:1 [1] ; 1280:1 [2] ; 640:1 [3] ; 320:1 [4] ; 160:1 [5] ; 80:1 [6] ; 40:1 [7] ; 20:1 [8] |
Error resilience | Start-of-packet headers; end-of-packet headers; segmentation symbols | Start-of-packet headers; end-of-packet headers; segmentation symbols |
Sampling rate | Stored in “Capture Resolution” fields | Stored in “Capture Resolution” fields |
Capture metadata | Embedded as XMP metadata in XML box | Embedded as XMP metadata in XML box |
As the table shows, most parameters are identical in both cases, except:
- The preservation masters are compressed losslessly, whereas for the access images irreversible (lossy) compression is used, using a fixed compression ratio of 20:1.
- The preservation masters contain 11 quality layers, whereas the access images only contain 8 quality layers.
Deriving the access image
In order to derive an access image from a preservation master, two approaches are possible:
- Create a “subset” from the preservation master that only contains the lower 8 quality layers (i.e. discarding the highest 3 layers).
- Do a full decode of the preservation master (e.g. to TIFF), and then re-compress the result to lossy JP2.
The main advantage of the first (“subset”) approach is its computational efficiency: it only involves some simple reformatting of data from the source image’s codestream, without any need to decode or compress the image data. I explored this in this 2013 blog post, and at the time I was able to make it work with Kakadu’s “transcode” tool and Aware’s JPEG 20000 SDK2. However, its success depends largely on the correct implementation of the quality layers in the preservation masters. For example, if the 8th quality layer in a preservation master was accidentally compressed at some other compression ratio than the expected 20:1 value, the resulting access JP2 could be (much) smaller or larger than expected. Complicating things further, even though jpylyzer will tell you both the overall compression ratio of a JP2 as well as the number of quality layers, it does not provide any information about the compression ratios of individual layers3.
Because of this, I only explored the full decode + re-compress approach here. Although computationally less efficient than the “subset” approach, it has the advantage that the result is independent of the implementation of the quality layers in the preservation masters.
Test environment
For my tests I used an ordinary desktop PC with 4 CPU cores, an Intel i5-6500 CPU (3.20GHz) processor and 12 GB RAM. The operating system was Linux Mint 20.1 Ulyssa (which is based on Ubuntu Focal Fossa 20.04).
Codecs
I initially planned to create a small proof of concept workflow based on Kakadu, as I already had some old test scripts for compressing TIFF images to JP2s that follow the KB’s master and access requirements. Then my colleague Sam Alloing suggested to have a look at the Grok codec. Although I had been aware of Grok for some time, I had never got around to take it for a spin, mainly because I haven’t been working much on anyting related to JPEG 20000 for the past few years. Since Grok is a fork of OpenJPEG, which the KB already uses to decode JP2 images on the Delpher platform, it then made sense to include OpenJPEG as well. So, in the end I used:
- OpenJPEG 2.4.2
- Grok 9.7.3
- Kakadu 7.94
I compiled both Grok and OpenJPEG from the source code. For Kakadu I used the pre-compiled demonstration binaries.
Test procedure
For each of the three codecs, I created a simple Bash script that takes an input and an output directory as its arguments. For each JP2 image in the input directory, the script goes through the following steps:
- Decode (uncompress) the JP2 to uncompressed TIFF.
- Compress the TIFF to lossy JP2, using (to the maximum extent posssible) the KB’s access JP2 requirements.
- Delete the TIFF file.
Once all input JP2s have been processed, the script then runs the jprofile tool on the output directory. Jprofile (which uses jpylyzer under its hood) uses Schematron rules to verify to what extent the generated JP2s conform to the KB access requirements.
The test scripts (which also contain the encoding parameter values for each codec) can be found here:
Performance
I ran each of the scripts on a directory with 26 preservation master JP2s (144 MB) from the KB’s collection of digitised books. Before running any of the scripts, I used the following command to empty my machine’s cache memory:
sudo sysctl vm.drop_caches=3
I then used the operating system’s built-in “time” tool to measure the processing time needed by each of the scripts:
(time ~/kb/jp2totiff/mastertoaccess-grok.sh ./master-1 ./access-1-grok) 2> time-grok.txt
The main metrics provided by this command are:
- “real” - the actual amount of time passed between starting the script and its termination.
- “user” - The sum of the processing times of each of the individual processors.
Below table shows the performance statistics for the three scripts:
Codec | time (real) | time (user) |
---|---|---|
OpenJPEG | 0m50.715s | 1m20.497s |
Grok | 0m22.143s | 1m1.308s |
Kakadu | 0m25.507s | 0m48.990s |
It’s worth noting that each of these figures encompasses a full decode-encode cycle, with some additional overhead added by jprofile, and system commands that remove the temorary TIFF files. I was surprised to see that at 22 seconds, the Grok-based script was even (marginally) faster than the Kakadu-based one, which clocks in at 26 seconds5. The script that uses OpenJPEG is considerably slower at 51 seconds.
Conformance to KB access requirements
The next table summarises the jprofile analysis, by listing the deviations from the KB acces requirements for each codec:
Codec | Deviations from KB access requirements |
---|---|
OpenJPEG | XML box missing, resolution box missing, ICC profile missing |
Grok | XML box missing |
Kakadu | - |
The OpenJPEG JP2s fall short on three aspects. An XML box with XMP metadata, a resolution box, and an ICC profile are all missing. This is not surprising, as OpenJPEG simply doesn’t support these features at this stage. In the Grok JP2s, only the expected XML box is missing. This is because Grok wraps XMP metadata in a so-called “UUID box”. This behaviour is consistent with the ISO/IEC base media file format, and is supported by e.g. Exiftool and jpylyzer. Only the Kakadu JP2s are 100% compliant with the requirements. However, since the exact location of XMP metadata doesn’t really matter for access, both the Kakadu and the Grok JP2s would be satisfactory for our purposes.
Conclusions
Although based on only a small sample dataset, this proof of concept demonstrates that both Grok and Kakadu would be suitable for generating lossy access JP2s from our preservation masters. The performance of both codecs turned out to be comparable for the test data used. This means that with Grok we now have an open-source codec that is both sufficiently feature-rich and performant to be a viable alternative to commercial codecs like Kakadu. One potential hurdle for some users might be Grok’s build process, which can be slightly involved because it requires very recent versions of CMake and gcc. However, using Grok’s documentation and these useful additional instructions by Harvard’s Bill Comstock I found the process easier than expected in the end. I’ve documented the full build and installation process that worked for me here.
Acknowledgements
Thanks are due to Grok developer Aaron Boxer for fixing two small issues I ran into while running my Grok tests, and Sam Alloing for suggesting to look into Grok.
Revision history
- 5 July 2022 - re-ran performance test with added
-threads
option for OpenJPEG, as suggested by Aaron Boxer in the comments.
Further resources
- Git repository with test scripts
- My Grok build and installation instructions
- Bill Comstock, “Installing OpenJPEG (and Grok) on Windows 10, Linux, and MacOS”
- Optimising archival JP2s for the derivation of access copies
- Jprofile - Automated JP2 profiling for digitisation batches
-
To be completely clear, at this stage this work is just an exploration of something we might do at some time in the future (or possibly not at all); there are no plans to actually implement this yet. ↩
-
As of 2022, Aware appears to have switched its focus to the development of biometrical software, and its website does not mention the JPEG 2000 SDK anymore. ↩
-
Adding this functionality to jpylyzer would require much more in-depth parsing of the codestream data than is currently the case. ↩
-
Note that this a pretty old version. ↩
-
These figures are not 100% comparable, because the Kakadu-based script includes an additional processing step to extract embedded metadata from the source file using ExifTool (Grok does this automatically at the codec level). ↩
-
JP2
- Generating lossy access JP2s from lossless preservation masters
- Jpylyzer 2015 round-up
- Response to report on JPEG 2000 expert round table
- Six ways to decode a lossy JP2
- Jpylyzer software finalist voor digitale duurzaamheidsprijs
- Optimising archival JP2s for the derivation of access copies
- ICC profiles and resolution in JP2: update on 2011 D-Lib paper
- Automated assessment of JP2 against a technical profile
- Update on jpylyzer
- Jpylyzer documentation
- A prototype JP2 validator and properties extractor
- A simple JP2 file structure checker
- Paper on JPEG 2000 for preservation
- Ensuring the suitability of JPEG 2000 for preservation
-
jpeg-2000
- Generating lossy access JP2s from lossless preservation masters
- Jpylyzer 2015 round-up
- Response to report on JPEG 2000 expert round table
- Six ways to decode a lossy JP2
- Jpylyzer software finalist voor digitale duurzaamheidsprijs
- Optimising archival JP2s for the derivation of access copies
- ICC profiles and resolution in JP2: update on 2011 D-Lib paper
- Automated assessment of JP2 against a technical profile
- Update on jpylyzer
- Jpylyzer documentation
- A prototype JP2 validator and properties extractor
- A simple JP2 file structure checker
- Paper on JPEG 2000 for preservation
- Ensuring the suitability of JPEG 2000 for preservation
-
jpylyzer
- Generating lossy access JP2s from lossless preservation masters
- Jpylyzer 2015 round-up
- Jpylyzer software finalist voor digitale duurzaamheidsprijs
- Adventures in Debian packaging
- Automated assessment of JP2 against a technical profile
- Update on jpylyzer
- Jpylyzer documentation
- A prototype JP2 validator and properties extractor
- A simple JP2 file structure checker
Comments