<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
        <title>bitsgalore.org</title>
        <description>bitsgalore.org - Johan van der Knijff</description>
        <link>https://bitsgalore.org</link>
        <link>https://bitsgalore.org</link>
        <lastBuildDate>2026-02-17T15:29:06+01:00</lastBuildDate>
        <pubDate>2026-02-17T15:29:06+01:00</pubDate>
        <ttl>1800</ttl>


        <item>
                <title>Emulating Inmagic DB/TextWorks databases with QEMU</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2026/02/y2k-furnishings.jpg&quot; alt=&quot;Photograph of store named &apos;Y2K Furnishings&apos;. The storefront window shows signs that read &apos;SALE&apos;, &apos;GOING OUT OF BUSINESS&apos; and &apos;FOR LEASE&apos;.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://www.flickr.com/photos/walkingsf/3670113255&quot; title=&quot;1339 Mission, 2000&quot;&gt;1339 Mission, 2000&lt;/a&gt; by &lt;a href=&quot;https://www.flickr.com/photos/walkingsf/&quot;&gt;Erica Fischer&lt;/a&gt;, &lt;a href=&quot;https://creativecommons.org/licenses/by/2.0/deed.en&quot; rel=&quot;license noopener noreferrer&quot;&gt;CC BY 2.0&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;A while ago, colleagues from our imaging studio approached me about an old &lt;a href=&quot;https://lucidea.com/inmagic-dbtextworks/&quot;&gt;Inmagic DB/TextWorks&lt;/a&gt; database. The database was originally created around the year 2000, and contained information about photographed items from our collection. From an old machine of a retired former colleague, they had recovered both the database, and installer files of two versions of the DB/TextWorks software. However, they were unable to either read the database with modern software, or to get the original software running on modern hardware.&lt;/p&gt;

&lt;p&gt;The obvious solution would be to run the software in an emulated environment, and use the emulation to export the database to a more readable format. Oracle VirtualBox used to be my go-to software for such emulation jobs. As I’m trying to move away from (US-based) big tech software, this time I decided to give &lt;a href=&quot;https://www.qemu.org/&quot;&gt;QEMU&lt;/a&gt; a try instead. Although I’ve used QEMU before, it’s been quite a while, and I noticed that a lot of online QEMU resources are quite outdated, which can make things confusing if you don’t know where to look.&lt;/p&gt;

&lt;p&gt;The main goal of this post is to document the main steps that make up a “migration through emulation” use case like this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The installation of QEMU.&lt;/li&gt;
  &lt;li&gt;The creation of a virtual machine (VM) in QEMU, and the installation of an operating system (in this case Windows XP).&lt;/li&gt;
  &lt;li&gt;The use of a “virtual thumb drive” to exchange data between the VM and the host machine.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even though this specifically covers the Inmagic DB/TextWorks case, most of this will be applicable to other “migration though emulation” use cases as well.&lt;/p&gt;

&lt;p&gt;In the final section I also give an overview of the general file structure of a DB/TextWorks 7.0 database, and I provide two openly-licensed example databases.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;software&quot;&gt;Software&lt;/h2&gt;

&lt;p&gt;The data recovered by my colleagues included two versions of the DB/TextWorks software: version 3.0 version from 1998, version 7.0.1 version from 2004. Since the time stamps of the database files indicated they were created in 2000, I went for the 7.0.1 version. Its documentation lists the following requirements:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Windows 2000 + Service Pack 3 or Windows XP.&lt;/li&gt;
  &lt;li&gt;Internet Explorer 5.01 or later, which is needed “for certain features to work properly”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I happened to have an ISO image of an old installation CD of Windows XP (Home Edition) lying around, which made XP the obvious emulation target. This also comes with a compatible Internet Explorer version preinstalled.&lt;/p&gt;

&lt;h2 id=&quot;host-system&quot;&gt;Host system&lt;/h2&gt;

&lt;p&gt;As a host system I used a desktop PC running Linux Mint 22.3 (MATE 64-bit).&lt;/p&gt;

&lt;h2 id=&quot;install-qemu&quot;&gt;Install QEMU&lt;/h2&gt;

&lt;p&gt;Installation of QEMU is as simple as&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;sudo apt install qemu-system
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;create-virtual-machine&quot;&gt;Create virtual machine&lt;/h2&gt;

&lt;p&gt;In an empty directory, create a sufficiently large (here: 10 GB) image file&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; (here called “winxp.img”):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-img create -f qcow2 winxp.img 10G
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then copy the ISO image of the Windows XP installation CD (“winxp.iso”) to the same directory. Now start a virtual machine (VM) using:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-system-x86_64 -m 1024 -hda winxp.img -cdrom winxp.iso -boot d
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This boots up a VM that runs the Windows XP setup program&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. To test the installation, close down Windows, and start the VM using:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-system-x86_64 -m 1024 -hda winxp.img
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If all goes well, this should boot the VM running Windows XP.&lt;/p&gt;

&lt;h2 id=&quot;data-exchange-between-vm-and-host-machine&quot;&gt;Data exchange between VM and host machine&lt;/h2&gt;

&lt;p&gt;In its current state, there’s no way to exchange data between the VM and host machine. For our use case, we need this to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Make the Inmagic DB/TextWorks installation files and the database available to the VM.&lt;/li&gt;
  &lt;li&gt;Make the exported database from the VM available to the host machine.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are a couple of ways to achieve this. In this case I opted for a “virtual thumb drive” approach, which works similarly to how you would use a physical thumb drive to exchange data between two machines.&lt;/p&gt;

&lt;h2 id=&quot;create-virtual-thumb-drive&quot;&gt;Create virtual thumb drive&lt;/h2&gt;

&lt;p&gt;Start by creating a 200 MB disk image (“tw.img”):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;dd bs=512 count=390625 if=/dev/zero of=tw.img
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then format this disk image as a FAT 32 filesystem&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mformat -F -i tw.img
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;copy-data-from-host-to-virtual-thumb-drive&quot;&gt;Copy data from host to virtual thumb drive&lt;/h2&gt;

&lt;p&gt;I used the &lt;a href=&quot;https://linux.die.net/man/1/mcopy&quot;&gt;mcopy&lt;/a&gt; tool to copy the required data from my host machine to the virtual thumb drive. In my case the data are in a directory tree under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/home/johan/kb/beeldstudio-textworks/Dbtmsg/&lt;/code&gt;. The following command recursively copies everything in that directory to the virtual thumb drive:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mcopy -i tw.img -s /home/johan/kb/beeldstudio-textworks/Dbtmsg/ ::
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You can verify the result by mounting the image file on your host machine. In Linux Mint, right-click the file in the file manager, and then select “Open with Disk Image Mounter”.&lt;/p&gt;

&lt;h2 id=&quot;start-vm-with-virtual-thumb-drive&quot;&gt;Start VM with virtual thumb drive&lt;/h2&gt;

&lt;p&gt;The following command starts up the Windows XP VM, and mounts the virtual thumb drive as a removable USB device:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-system-x86_64 -m 1024 -hda winxp.img \
-drive if=none,id=usbstick,format=raw,file=tw.img \
-usb \
-device usb-ehci,id=ehci \
-device usb-storage,bus=ehci.0,drive=usbstick
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In my case the virtual thumb drive is available as the “E:” drive in Windows XP. Note that it can take some time before Windows detects the device, so be patient here.&lt;/p&gt;

&lt;h2 id=&quot;inmagic-dbtextworks-installation&quot;&gt;Inmagic DB/TextWorks installation&lt;/h2&gt;

&lt;p&gt;To install DB/TextWorks, open Windows Explorer, navigate to the folder with the installer files on the virtual thumb drive (in my case the “E:” drive on the WinXP VM), and run setup.exe.&lt;/p&gt;

&lt;h2 id=&quot;open-a-database&quot;&gt;Open a database&lt;/h2&gt;

&lt;p&gt;A DB/TextWorks database is made up of various separate files that share a common base name. The primary entry point is a file with a “.TBA” file extension. To open a database, select “Open” from the “File” menu, and navigate to the folder on the virtual thumb drive that contains the database:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2026/02/textworks-open.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;After loading, DB/TextWorks shows a window with all fields in the database:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2026/02/twDB.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;It’s a good idea to make a screenshot of this, as the CSV exports don’t contain the full field names!&lt;/p&gt;

&lt;h2 id=&quot;export-database-to-comma-delimited-text&quot;&gt;Export database to comma-delimited text&lt;/h2&gt;

&lt;p&gt;From the “File” menu, select “Export …”, navigate to a location on your virtual thumb drive where you want to store your export, and enter an output file name. In the Export Options dialog that appears, select “Delimited ASCII Format” as the output format&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;, and make sure to check the “Store Field Names in First Row” checkbox.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2026/02/twExportOptions.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Click “OK” to start the export. When done, close DB/TextWorks, and shut down the virtual machine.&lt;/p&gt;

&lt;h2 id=&quot;access-the-exported-database-from-the-host-machine&quot;&gt;Access the exported database from the host machine&lt;/h2&gt;

&lt;p&gt;To access the exported database from the host machine, simply mount the virtual thumb drive image file. In Linux Mint, you can do this by right-clicking the file in the file manager, and then selecting “Open with Disk Image Mounter”. Now you can access the file system, and copy the exported file to the host machine’s file system.&lt;/p&gt;

&lt;h2 id=&quot;some-observations-on-the-exported-csv-files&quot;&gt;Some observations on the exported CSV files&lt;/h2&gt;

&lt;p&gt;The column headings in the exported CSV files only shows the first two characters of each field’s name. For instance, in my case the database contains the fields “Signatuur” and “Magazijn”, which show up as “SI” and “MA” in the exported file. I wasn’t able to change this behavior. Because of this, it’s a good idea to make a note (or a screenshot) of the full names from the DB/TextWorks interface.&lt;/p&gt;

&lt;p&gt;Another thing that caught my attention is that DB/TextWorks variables may contain multiple entries. In that case, each comma-separated item in the exported file is sub-divided in fields that are separated by pipe (“|”) characters.&lt;/p&gt;

&lt;h2 id=&quot;database-structure&quot;&gt;Database structure&lt;/h2&gt;

&lt;p&gt;Each DB/Textbase database is made up of multiple files. These files share a common base name, but each has a different file extension. The User Manual gives the following overview&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;table&gt;
    &lt;thead&gt;
      &lt;tr&gt;
        &lt;th style=&quot;text-align: left&quot;&gt;Extension&lt;/th&gt;
        &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
      &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.TBA&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Primary textbase definition file, which also contains textbase elements (for example, forms, query screens, sets, record skeletons) stored in the textbase.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.ACF&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Access control file; controls simultaneous access to the textbase by multiple users or software instances, or applications (for example, DB/Text PowerPack Lite).&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.DBS&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Textbase structure file; contains field definitions and other information about the structure of the textbase.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.IXL&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Indexed list file; contains the validation and substitution lists, and the leading article and stop word lists.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.DBR&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Contains the records (including deferred new, deleted, or changed records).&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.DBO&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Contains a directory to the records in the .DBR file.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.SDO&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Contains a directory to records with deferred updates in the .DBR file.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.BTX&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Contains the Term and Word indexes.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.OCC&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Contains the lists of records indexed by the terms and words in the .BTX file.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.LOG&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Optional textbase log file; lists changes to the textbase structure and records.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.TML&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Thesaurus maintenance locking file, prevents more than one person at a time from modifying records in that thesaurus textbase. Note that .TML files do not have to be backed up. The software automatically creates them if they do not exist.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.HLP&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Optional textbase-specific help file.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.INI&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Optional file used with Copy Special applications, Textbase-Specific Help, the DB/Text ODBC Driver, and the Applications menu.&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;.SLT&lt;/td&gt;
        &lt;td style=&quot;text-align: left&quot;&gt;Optional file that is created when EnableSlotlog=1 appears in the [Advanced] section of the DBTEXT.INI or textbase .INI file. This option can be set in DBTEXT.INI during Setup using the Track Textbase Access button on the Configuration dialog box. The machine name and login name of each user who has a textbase open is recorded in the .SLT file. The line is cleared when each closes the textbase.&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/tbody&gt;
  &lt;/table&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;example-databases&quot;&gt;Example databases&lt;/h2&gt;

&lt;p&gt;Since DB/TextWorks running already, I though it would be useful to use it to make some sample files. I created two simple databases: one with DB/TextWorks 7.0.1, and another one with the older DB/TextWorks 3.0. I added both &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/DBTextWorks&quot;&gt;to the OPF Format Corpus&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/DBTextWorks&quot;&gt;DB/TextWorks sample files in the OPF Format Corpus&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.andornot.com/software/dbtextworks/dbtextworks-versions-and-features/&quot;&gt;DB/TextWorks Versions and Features&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://support.inmagic.com/Web/DBTWandWPP900/DBTextWorksv9UsersManual.pdf&quot;&gt;DBTextWork User’s Manual&lt;/a&gt; (this documents version 9, which is slightly more recent than version 7 covered in this post).&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://support.inmagic.com/Web/DBTWandWPP900/Readme.HTM&quot;&gt;Inmagic® DB/TextWorks® and DB/Text® WebPublisher PRO Version 9.00 README File&lt;/a&gt; (also covers version 9).&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://en.wikibooks.org/wiki/QEMU/Windows_XP&quot;&gt;QEMU/Windows XP&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.wikihow.com/Activate-Windows-XP&quot;&gt;How to Activate Windows XP in 2024&lt;/a&gt; - method 3 (Disabling Activation) is a Windows registry hack that disables activation altogether, which is useful for VMs with no internet access.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;I largely followed the instructions I found &lt;a href=&quot;https://en.wikibooks.org/wiki/QEMU/Windows_XP&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;The actual file size is much smalller, because qemu uses compression. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Note that the Windows XP setup process requires you to enter a 25-character product key. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;This requires the &lt;a href=&quot;https://www.gnu.org/software/mtools/&quot;&gt;mtools&lt;/a&gt; package. If I’m not mistaken this comes preinstalled with most Linux distributions, but if needed you can install it using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sudo apt install mtools&lt;/code&gt;. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;TextWorks 7 also supports XML as an export format. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;Taken from the User Manual of version 7 of the software, which is not available online. A copy of the more recent version 9 User Manual is &lt;a href=&quot;http://support.inmagic.com/Web/DBTWandWPP900/DBTextWorksv9UsersManual.pdf&quot;&gt;available here&lt;/a&gt;. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2026/02/16/emulating-inmagic-dbtextworks-databases-with-qemu</link>
                <guid>https://bitsgalore.org/2026/02/16/emulating-inmagic-dbtextworks-databases-with-qemu</guid>
                <pubDate>2026-02-16T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Writerperfect conversion tools for legacy file formats</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/06/1024px-Tools_66.jpeg&quot; alt=&quot;Photo that shows assortment of tools.&quot; /&gt;
  &lt;figcaption&gt;Tools by &lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Tools_66.jpg&quot;&gt;Wilfredor&lt;/a&gt;, CC0, via Wikimedia Commons.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;My recent work on the &lt;a href=&quot;/2025/05/19/emulating-microsoft-multiplan-spreadsheets-in-dosbox-x&quot;&gt;Microsoft Multiplan&lt;/a&gt; and &lt;a href=&quot;/2025/05/28/quattro-pro-for-dos-revisited-an-obsolete-format-no-more&quot;&gt;Quattro Pro for DOS&lt;/a&gt; formats made me think how far LibreOffice has come over the past years in its support of legacy file formats. These formats are supported through software libraries that are developed within the &lt;a href=&quot;https://www.documentliberation.org/&quot;&gt;Document Liberation Project&lt;/a&gt; (DLP). This project was set up in 2014 by &lt;a href=&quot;https://www.documentfoundation.org/&quot;&gt;The Document Foundation&lt;/a&gt;, which is also the home of LibreOffice. Aside from their use in LibreOffice, these libraries are also the foundation of a set of stand-alone command-line tools that allow you to convert a wide range of legacy file formats to the &lt;a href=&quot;https://en.wikipedia.org/wiki/OpenDocument&quot;&gt;OpenDocument&lt;/a&gt; formats and EPUB.&lt;/p&gt;

&lt;p&gt;Much of the available information about these libraries and tools is scattered across different platforms. The command-line tools are also surprisingly hard to find, even though they have been around for a long time. This short post is an attempt at bringing the most important information I could find about them together.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;libraries&quot;&gt;Libraries&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://www.documentliberation.org/projects/&quot;&gt;The projects page&lt;/a&gt; on the Document Liberation Project website lists all libraries and tools that are developed as part of the project. The “Import libraries” section is particularly interesting. Many of these libraries support multiple file formats, but this is not always obvious from their names. More detailed information, including the file formats they support, is available &lt;a href=&quot;https://wiki.documentfoundation.org/DLP/Libraries&quot;&gt;on their Wiki page&lt;/a&gt;. As an example, for the &lt;a href=&quot;https://sourceforge.net/projects/libwps&quot;&gt;libwps&lt;/a&gt; library, the Wiki page currently shows 10 different formats, including Quattro Pro and Lotus 123 spreadsheets&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;command-line-tools&quot;&gt;Command-line tools&lt;/h2&gt;

&lt;p&gt;The DLP import libraries are also used in a set of command-line conversion tools. These tools are quite difficult to find. For a start, their source code is hidden away in the &lt;a href=&quot;https://sourceforge.net/projects/libwpd/&quot;&gt;libpwd&lt;/a&gt; library, as a sub-project called &lt;a href=&quot;https://sourceforge.net/p/libwpd/writerperfect/&quot;&gt;writerperfect&lt;/a&gt;. I haven’t (yet) tried to compile the tools myself, but binaries are available for various Linux-based platforms. For Debian-based systems, each binary that is built from the writerperfect source is provided as a separate Debian package, which results in 27 distinct tools/packages&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. For convenience this table list all of these tools:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Tool&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;abw2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AbiWord to EPUB format converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;abw2odt&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AbiWord to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;cdr2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Corel Draw graphics to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ebook2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;other E-Book formats to EPUB converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ebook2odt&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;E-Book formats to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;fh2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Freehand to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;key2odp&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Keynote to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;mwaw2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;old Mac formats to EPUB converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;mwaw2odf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;old Mac formats to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;numbers2ods&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Apple Numbers spreadsheet documents to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;pages2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Apple Pages to EPUB converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;pages2odt&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Apple Pages text documents to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;pmd2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Apple Pagemaker to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;pub2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Publisher documents to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;qxp2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;QuarkXPress to EPUB converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;qxp2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;QuarkXPress to OpenDocument graphics converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;sd2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;StarOffice to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;sd2odf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;StarOffice to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;vsd2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Visio to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;wks2ods&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Works spreadsheet documents to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;wpd2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;WordPerfect document to EPUB converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;wpd2odt&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;WordPerfect to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;wpg2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;WordPerfect Graphics to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;wps2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Works text document to EPUB converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;wps2odt&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Works text documents to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;zmf2epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Zoner Draw to EPUB converter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;zmf2odg&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Zoner Draw to OpenDocument converter&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;installation-debian&quot;&gt;Installation (Debian)&lt;/h2&gt;

&lt;p&gt;The binary package names are identical to the tool names, so in order to install the “wks2ods” tool from its Debian package, use this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;sudo apt install wks2ods
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;All other tools follow the same pattern. As with the underlying import libraries, the tool names don’t always immediately give away what formats they support. The documentation is also quite minimal. As an example, despite its name, the “wks2ods” tool supports a range of legacy spreadsheet formats that goes far beyond “Works spreadsheet documents”. The &lt;a href=&quot;https://wiki.documentfoundation.org/DLP/Libraries&quot;&gt;information on the import libraries Wiki page&lt;/a&gt; and some experimentation should get you a long way though.&lt;/p&gt;

&lt;h2 id=&quot;using-the-command-line-tools&quot;&gt;Using the command-line tools&lt;/h2&gt;

&lt;p&gt;The basic usage of these tools is very simple. Taking the “wks2ods” tool as an example again:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wks2ods [OPTIONS] INPUT [OUTPUT]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;INPUT&lt;/code&gt; is the input file, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OUTPUT&lt;/code&gt; defines the output file. If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OUTPUT&lt;/code&gt; is omitted, the tool will print flat ODF to standard output. The tool has the following options:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--encoding ENCODING&lt;/code&gt;: this sets the INPUT encoding.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--list-encodings&lt;/code&gt;: this shows the available encodings and then exits.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--password PASSWORD&lt;/code&gt;: this sets a password to open the input file.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--stdout&lt;/code&gt;: this prints the result as flat XML to standard output.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All tools share the same general command-line interface (note that I haven’t tried all of them, so there may be minor differences that I’m not aware of!).&lt;/p&gt;

&lt;h2 id=&quot;examples&quot;&gt;Examples&lt;/h2&gt;

&lt;p&gt;In the simplest case we can just run “wks2ods” with the name of the input and output files as command-line arguments. For example, the command below converts a Quattro Pro for DOS spreadsheet to OpenDocument Spreadsheet format:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wks2ods WHATEVER.WQ2 WHATEVER.ods
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You may need to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--encoding&lt;/code&gt; option to define the encoding of the input file. It’s a good idea to always inspect the output file for any obvious encoding problems.&lt;/p&gt;

&lt;p&gt;The following command converts a WordPerfect file to OpenDocument Text format with the “wpd2odt” tool:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wpd2odt WHATEVER.WPD WHATEVER.odt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And this converts a  WordPerfect file to EPUB using the “wpd2epub” tool:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wpd2epub WHATEVER.WPD WHATEVER.epub
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;Since the Writerperfect tools are based on the same software libraries that are also used by LibreOffice, the conversion results will be identical to a LibreOffice conversion in most cases. The main advantage of these tools is that they can be easily integrated in automated workflows, independent of the LibreOffice applications.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.documentliberation.org/&quot;&gt;Document Liberation Project&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://wiki.documentfoundation.org/DLP/Libraries&quot;&gt;DLP Libraries Wiki page&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://sourceforge.net/p/libwpd/writerperfect/&quot;&gt;writerperfect source code&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;The information on the Wiki doesn’t appear to be entirely up to date, as this library also supports the Microsoft Multiplan formats, which aren’t listed here. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;See for example &lt;a href=&quot;https://packages.ubuntu.com/source/oracular/writerperfect&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2025/06/10/writerperfect-conversion-tools-for-legacy-file-formats</link>
                <guid>https://bitsgalore.org/2025/06/10/writerperfect-conversion-tools-for-legacy-file-formats</guid>
                <pubDate>2025-06-10T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Quattro Pro for DOS revisited&#58; an obsolete format no more?</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/quattro-floppy.jpg&quot; alt=&quot;Photo of 3.5 inch installation floppy of Quattro for DOS, version 5.0.&quot; /&gt;
  &lt;figcaption&gt;Image sourced from &lt;a href=&quot;https://archive.org/details/BorlandQuattroPro5.0ForDOSGerman&quot;&gt;Internet Archive&lt;/a&gt;, license unknown.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;in 2014 I wrote &lt;a href=&quot;/2014/10/29/quattro-pro-dos-obsolete-format-last&quot;&gt;a post&lt;/a&gt; on the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Quattro_Pro&quot;&gt;Quattro Pro&lt;/a&gt; for DOS spreadsheet formats. This documents my attempts at reading a few old Quattro Pro for DOS spreadsheets from my personal archives with modern (at the time) software. Back then, neither Microsoft Excel nor LibreOffice Calc supported these formats. Only the then-current Quattro Pro X7 was able to read the files, but there were several issues related to formatting, rendering of charts, and the handling of external references. Based on these tests, I argued that Quattro Pro for DOS could be a case of that rarest species in the world of digital preservation: a file format that had truly become obsolete!&lt;/p&gt;

&lt;p&gt;While working on my &lt;a href=&quot;/2025/05/19/emulating-microsoft-multiplan-spreadsheets-in-dosbox-x&quot;&gt;previous post about Microsoft Multiplan spreadsheets&lt;/a&gt;, I was surprised to see that &lt;a href=&quot;https://www.libreoffice.org/&quot;&gt;LibreOffice&lt;/a&gt; actually supported this ancient format. A quick glance at &lt;a href=&quot;https://wiki.documentfoundation.org/Feature_Comparison:_LibreOffice_-_Microsoft_Office&quot;&gt;LibreOffice’s feature matrix&lt;/a&gt; showed that LibreOffice has added support for many legacy formats over the past years, including Quattro Pro for DOS&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Under the hood LibreOffice uses the &lt;a href=&quot;https://sourceforge.net/p/libwps/wiki/Home/&quot;&gt;Microsoft Works format import library (libwps)&lt;/a&gt; for this. The same library is also used to read the &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiplan&quot;&gt;Microsoft Multiplan&lt;/a&gt; for DOS and &lt;a href=&quot;https://en.wikipedia.org/wiki/Lotus_1-2-3&quot;&gt;Lotus 123&lt;/a&gt; file formats. This made me wonder if the conclusions of my 2014 post would still hold up. This follow-up post puts this to the test.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;quattro-pro&quot;&gt;Quattro Pro&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Quattro_Pro&quot;&gt;Quattro Pro&lt;/a&gt; is a spreadsheet program that was first released in 1988. It’s still around today as part of the  &lt;a href=&quot;https://www.wordperfect.com/en/product/office-suite/&quot;&gt;WordPerfect Office suite&lt;/a&gt;. &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Quattro_Pro&quot;&gt;A number of file formats&lt;/a&gt; are associated with the software. As with my 2014 post, the scope of this follow-up is restricted to the old Quattro Pro for DOS formats:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WQ1&quot;&gt;Quattro Pro for DOS, versions 1-4 (WQ1)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WQ2&quot;&gt;Quattro Pro for DOS, versions 5.0 and 5.5 (WQ2)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;test-files&quot;&gt;Test files&lt;/h2&gt;

&lt;p&gt;I used the same test files as in my 2014 post (which are all part of the &lt;a href=&quot;https://github.com/openpreserve/format-corpus/&quot;&gt;OPF Format Corpus&lt;/a&gt;):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;One &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/wq1&quot;&gt;WQ1 file&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Three &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/wq2&quot;&gt;WQ2 files&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are all files from my personal archives, and I originally created them in 1996.&lt;/p&gt;

&lt;h2 id=&quot;emulation-of-original-software&quot;&gt;Emulation of original software&lt;/h2&gt;

&lt;p&gt;To evaluate LibreOffice Calc’s handling of my test files, I emulated the original software, and used the emulation as a reference. &lt;a href=&quot;https://winworldpc.com/home&quot;&gt;WinWorld&lt;/a&gt; has &lt;a href=&quot;https://winworldpc.com/product/quattro-pro/&quot;&gt;installers for most old Quattro Pro versions&lt;/a&gt;. From what I recall I created most of my files with Quattro Pro 5.0 for DOS, so I downloaded &lt;a href=&quot;https://winworldpc.com/download/5274c3aa-0825-c389-11c3-a6e280947e52&quot;&gt;the installers of that version&lt;/a&gt;. I then installed the software in &lt;a href=&quot;https://dosbox-x.com/&quot;&gt;DOSBox-X&lt;/a&gt;. My initial attempts at this failed, because the installer wouldn’t recognise the second installation disk. As a workaround, I simply copied the files from both installation disks to a temporary folder, and then ran the installer from there. This resulted in a successful installation.&lt;/p&gt;

&lt;h2 id=&quot;simple-numbers-and-text&quot;&gt;Simple numbers and text&lt;/h2&gt;

&lt;p&gt;I started out with &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq1/KSBASE.WQ1&quot;&gt;this WQ1 spreadsheet&lt;/a&gt;, which only contains simple numerical and text data. The DOSBox-X emulation rendered this file without any apparent problems:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/qp5-wq1.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;You may notice that the first column appears as shaded. Apparently the file was originally saved with the “locked titles” option enabled, which can be used to lock the first row or column during scrolling. This can easily be disabled.&lt;/p&gt;

&lt;p&gt;Next I opened the same file in LibreOffice Calc&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/lo-wq1.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Calc reads the file without any problems. There are some minor differences compared to the emulation: Calc doesn’t “lock” the first column, the date format is slightly different, and Calc shows more decimal places for the numbers in the J, K and L columns. All of these are mostly cosmetic differences, and for practical purposes Calc’s rendering looks perfectly usable. The results were largely identical for &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq2/KSBASE.WQ2&quot;&gt;the WQ2 version of the same spreadsheet&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;simple-formulas-charts&quot;&gt;Simple formulas, charts&lt;/h2&gt;

&lt;p&gt;Next up was &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq2/KS4001.WQ2&quot;&gt;KS4001.WQ2&lt;/a&gt;, which contains some formulas and a chart. Here are two screenshots of the DOSBox-X emulation of this file:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/qp-wq2-1.png&quot; /&gt;
&lt;/figure&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/qp-wq2-2.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;And here’s a screenshot that shows what LibreOffice Calc makes of this:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/lo-wq2.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Compared to the emulated version, the most notable difference is the formatting of the chart, but overall the Calc rendering gives a good representation of the data in the original spreadsheet. Interestingly, in &lt;a href=&quot;/2014/10/29/quattro-pro-dos-obsolete-format-last&quot;&gt;my 2014 analysis&lt;/a&gt; the then-current Quattro Pro X7 version failed to render the chart correctly:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2014/10/ks4001_wq2.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;It’s reassuring to see that LibreOffice Calc now does a much better job!&lt;/p&gt;

&lt;h2 id=&quot;external-references&quot;&gt;External references&lt;/h2&gt;

&lt;p&gt;Finally there’s &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq2/KS4000.WQ2&quot;&gt;this WQ2 spreadsheet&lt;/a&gt;, which contains references to another spreadsheet (KSBASE.WQ2) that is located in the same directory. This file exposed some interesting issues, and the following sub-sections are an attempt to demonstrate these as best as I can. Be warned though that what follows is slightly convoluted, which is due to a combination of the nature of the spreadsheet, and the fact that both issues are interrelated.&lt;/p&gt;

&lt;h3 id=&quot;quattro-pro-emulation&quot;&gt;Quattro Pro emulation&lt;/h3&gt;

&lt;p&gt;First I opened this file in the emulator. On loading, Quattro Pro detects that it contains external references, and prompts the user what to do with them:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/qp-wq2-ext-dialog.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Here I chose “Load Supporting”, after which the file loaded like this:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/qp-wq2-ext.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;The external references, which are in cells E4, G4, B5, E5 and H5, are all resolved correctly. To double-check this, I also tried to open the file after removing the referenced file. This resulted in “NA” values for all cells with references to this file, as well as for all cells that depend on it.&lt;/p&gt;

&lt;p&gt;I initially assumed that the shown “ERR” values were related to the external references, but on closer inspection this doesn’t seem to be the case at all. Instead, they simply result from the fact that the C column doesn’t contain any data.&lt;/p&gt;

&lt;h3 id=&quot;initial-behaviour-in-libreoffice-calc&quot;&gt;Initial behaviour in LibreOffice Calc&lt;/h3&gt;

&lt;p&gt;Opening the file in LibreOffice Calc also results in a notification message that “Automatic update of external links has been disabled”:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/lo-wq2-ext-dialog.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Note how the values in cells E4 and G4 (which both contain external references that return a text string in the emulation) are shown as “#REF”. Interestingly though, cells B5, E5 and H5 also contain external references, but the (correct!) numerical values are shown nevertheless. This could mean either of the following:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The numbers shown are “cached” values that were saved alongside the formulas at the time of the file’s creation.&lt;/li&gt;
  &lt;li&gt;The numbers are imported from the linked spreadsheet (even though this shouldn’t really happen).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To test which of these is true, I created a copy of the spreadsheet in an empty directory. When I opened it in Calc, it showed the exact same behaviour. This means the numbers really &lt;em&gt;are&lt;/em&gt; cached values. This does raise the question why Calc only shows cached &lt;em&gt;numbers&lt;/em&gt;, and not cached &lt;em&gt;text strings&lt;/em&gt;. I will return to this in the final sections of this post.&lt;/p&gt;

&lt;h3 id=&quot;restoring-the-external-references&quot;&gt;Restoring the external references&lt;/h3&gt;

&lt;p&gt;First let’s go back to our original spreadsheet. After clicking on “Allow updating”, Calc showed a notification that the external file that is referenced could not be loaded:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/lo-external-file.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;The reason for this is, that all references to the external file are defined as a file name without a file extension (i.e. KSBASE instead of KSBASE.WQ2). A quick look at the &lt;a href=&quot;https://help.libreoffice.org/latest/en-US/text/shared/01/02180000.html&quot;&gt;LibreOffice documentation&lt;/a&gt; shows how to fix this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;In Calc’s “Edit” menu, go to “Links to External Files…”. This brings up a link editor dialog.&lt;/li&gt;
  &lt;li&gt;Select the link you want to change, and click on “Modify …”&lt;/li&gt;
  &lt;li&gt;In the file dialog that appears, select the file you want to link to (in this case KSBASE.WQ2).&lt;/li&gt;
  &lt;li&gt;Click on the “Update” button, and then close the dialog.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;behaviour-after-recalculate&quot;&gt;Behaviour after recalculate&lt;/h3&gt;

&lt;p&gt;With the link restored, I opened the “Data” menu, and selected &lt;a href=&quot;https://help.libreoffice.org/latest/en-US/text/scalc/01/recalculate_hard.html&quot;&gt;“Calculate/Recalculate Hard”&lt;/a&gt;. This forces all cells to be re-calculated. The results of this operation were not quite what I expected though:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/lo-wq2-ext-updated.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Even though the “#REF” values are now replaced by actual values from the external spreadsheet, these values are different from those in the original Quattro Pro rendering. For example:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The value of cell B5 is supposed to be a number (variable DEPTH, 30), but instead it is now a text string (“sal”).&lt;/li&gt;
  &lt;li&gt;The value of cell G4 is supposed to be a text string (variable SOIL, “c”), but instead it is now a numerical value (3133.03).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So what’s going on here?&lt;/p&gt;

&lt;h3 id=&quot;vlookup-function-behaviour-in-quattro-pro-and-calc&quot;&gt;VLOOKUP function behaviour in Quattro Pro and Calc&lt;/h3&gt;

&lt;p&gt;To understand why this happens, it is first important to know that the external references in rows 4 and 5 of the spreadsheet are all table lookup operations. As an example, the Quattro Pro formula in cell B5 (copied here from the emulated version) is:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;@VLOOKUP($B$4,[KSBASE]A:$A$2..$I$83,6)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This function has 3 arguments:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A lookup value.&lt;/li&gt;
  &lt;li&gt;A data block in the external spreadsheet, where the first column is used as an index column.&lt;/li&gt;
  &lt;li&gt;The column number in the data block that contains the requested value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The equivalent function in LibreOffice Calc is &lt;a href=&quot;https://help.libreoffice.org/25.2/en-US/text/scalc/01/04060109.html?&amp;amp;DbPAR=SHARED&amp;amp;System=UNIX#Section9&quot;&gt;VLOOKUP&lt;/a&gt;. Here’s a portion of the relevant data in the external spreadsheet (which is used as the data block):&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/ksbase-lookup.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Note how column G contains the expected value (30) for the DEPTH variable (highlighted here in green), but this is the 7th column in the data block, not the 6th! The 6th column (variable SOIL) contains the text string “sal” (highlighted here in red), and this is also what is returned by Calc’s VLOOKUP function. Other variables that are based on this function are similarly offset by exactly one column.&lt;/p&gt;

&lt;p&gt;As it turns out, Quattro Pro’s @VLOOKUP function and Calc’s VLOOKUP function each treat the data block geometry slightly differently: in Quattro Pro, the index of the first column is defined as 0, whereas it is 1 in Calc.  This means that when you use Calc to open a Quattro Pro for DOS spreadsheet that uses the @VLOOKUP function, the values that are returned by Calc’s VLOOKUP will be offset by exactly one column&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;I created a small, self-contained test file that demonstrates this issue, and wrote some accompanying documentation. Both are &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/wq2/vlookup-compat-demo&quot;&gt;available here in the OPF Format Corpus&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;handling-of-cached-formula-values&quot;&gt;Handling of cached formula values&lt;/h3&gt;

&lt;p&gt;As I explained before, my initial attempt at opening the spreadsheet (before recalculation) showed that Calc loads “cached” values for externally referenced cells, but &lt;em&gt;only&lt;/em&gt; if these values are numbers. In case of strings, a “#REF” value was shown in my tests. There are two plausible explanations for this. Either:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Quattro Pro stores a cached value if the formula result is a number, but not if it is a text string, or,&lt;/li&gt;
  &lt;li&gt;Quattro Pro stores a cached value irrespective of the formula result’s data type, but Calc doesn’t correctly handle it if it is a string.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Again, this is something we can easily test by isolating a copy of the spreadsheet in an empty directory, and then opening that copy in the Quatro Pro emulation:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/qp-ref-isolated.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Here we see that Quattro Pro shows the (correct) cached value for each cell with an external reference, irrespective of whether it is a number or a text string. This leads to the conclusion that Quattro Pro &lt;em&gt;does&lt;/em&gt; store cached string values, but that Calc doesn’t correctly handle them. I was able to confirm this with some additional tests on another self-contained test file. For brevity I won’t go into details here, but the documentation of these tests and the test file &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/wq2/external-reference-demo&quot;&gt;can be found here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;combined-effect-of-vlookup-and-cached-values-issues&quot;&gt;Combined effect of VLOOKUP and cached values issues&lt;/h3&gt;

&lt;p&gt;The cached values issue does not only affect cells with external references, but &lt;em&gt;any&lt;/em&gt; cell that contains a formula. We can see this in the previously created &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/wq2/vlookup-compat-demo&quot;&gt;VLOOKUP demo file&lt;/a&gt;, which renders like this in the Quattro Pro emulation:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/qp-vlookup-demo.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;The top 11 rows here contain a 5-column block of data, which is queried with the @VLOOKUP function in row 15. When I opened this file in Calc, this initialy produced:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/lo-init.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Note how the values in cells D15 and E15 (which both result in a text string in the emulation) are different from the original rendering, while the values in B15 and C15 (both numbers) are correct. After re-calculating (“Data/Calculate/Recalculate Hard”), I got this:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/lo-recalc.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;This is the familiar column shift pattern that results from the @VLOOKUP compatibility issue. However, by itself this doesn’t explain why cells B15 and C15 showed the correct values &lt;em&gt;before&lt;/em&gt; the recalculation!&lt;/p&gt;

&lt;p&gt;My best guess is that on opening a Quattro Pro for DOS spreadsheet, Calc’s intended behaviour is to load the cached cell values, instead of recalculating the underlying formulas. This would be entirely sensible for legacy spreadsheet formats, since it is highly likely that not all functions in their original creation software are completely compatible with Calc. Since the tests in the previous section showed that Calc is unable to read cached values that are text strings, I suspect that in this particular case, Calc recalculates those cells as a fallback. This then results in a rendering that contains a mix of both cached and recalculated values. In most situations this would go unnoticed by the user, but not here, since the recalculated values are affected by the @VLOOKUP compatability issue.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;In the concluding section of &lt;a href=&quot;/2014/10/29/quattro-pro-dos-obsolete-format-last&quot;&gt;my 2014 post&lt;/a&gt; I wrote:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[N]o modern-day software is able to correctly handle the Quattro Pro for DOS formats. Add to this that the Quattro Pro for DOS formats are proprietary with (as far as I’m aware) no publicly available specifications, and I think we have a pretty strong candidate for a format that may be (nearly) obsolete.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More than ten years onward, things are looking much better now. LibreOffice Calc, which is both free and open-source, was able to read all my old WQ1 and WQ2 files, and it did a better job at this than the (proprietary/closed) Quattro Pro X7 version I used in 2014. The tests with my external reference spreadsheet highlighted two issues though:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;An incompatability between a legacy Quattro Pro for DOS function and its modern equivalent in Calc.&lt;/li&gt;
  &lt;li&gt;A lack of support for cached formula values that are text strings, which in turn triggers inconsistent recalculation behaviour.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Realistically, I doubt the compatibility issue itself can (or even should!) be fixed in Calc. Given the large number of legacy spreadsheet formats that Calc supports, and the wide array of functions within each of these formats, there may be many other, similar compatability issues lurking beneath the surface. That being said, the effects of such issues could be mitigated by addressing the second issue, specifically:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;By adding read support for cached formula values that are not numbers (text strings, possibly other types as well).&lt;/li&gt;
  &lt;li&gt;By reconsidering the current recalculation behaviour on opening a file. If no cached value can be retrieved, I think Calc shouldn’t automatically recalculate the formula as a fallback, but instead just let the user know that it is not available. (Of course, after this it’s up to the user if they want to manually force a recalculation or not.)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With these changes, Calc’s rendering of a Quattro Pro for DOS file would provide the best approximation of its state when it was last saved.&lt;/p&gt;

&lt;p&gt;I’ve submitted a &lt;a href=&quot;https://bugs.documentfoundation.org/show_bug.cgi?id=166706&quot;&gt;bug report&lt;/a&gt; about both issues, where I also mentioned the above suggestions.&lt;/p&gt;

&lt;p&gt;The results of my tests also underline -once again- the risks of migrating or “normalizing” legacy spreadsheets to some modern format. On their own, each of the reported issues can already lead to lost or altered data in a migration action. The combined effect of both issues can result in more complex changes that would be nearly impossible to trace back to the source data. At the very minimum, the original source file should always be kept, as well as a full audit trail of the migration process. Both should also be made available to the user, so they can use it to make an informed judgment of the accuracy of the migrated data.&lt;/p&gt;

&lt;p&gt;Just like &lt;a href=&quot;/2025/05/19/emulating-microsoft-multiplan-spreadsheets-in-dosbox-x&quot;&gt;Microsoft Multiplan&lt;/a&gt;, the Quattro Pro for DOS case confirms &lt;a href=&quot;https://blog.dshr.org/2009/01/are-format-specifications-important-for.html&quot;&gt;David Rosenthal’s 2009 assertion&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Clearly, formats with open source renderers are, for all practical purposes, immune from format obsolescence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So would I still call the Quattro Pro for DOS format(s) obsolete? Probably not. This would make it the first example I’m aware of, of a file format that was effectively obsolete 10 years ago, but isn’t anymore!&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/2014/10/29/quattro-pro-dos-obsolete-format-last&quot;&gt;Quattro Pro for DOS: an obsolete format at last?&lt;/a&gt; (original 2014 post)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://openpreservation.org/blogs/opening-johans-quattro-pro-files-quattro-pro-6-win-311/&quot;&gt;Opening Johan’s Quattro Pro files in Quattro Pro 6 for Win 3.11&lt;/a&gt; by Euan Cochrane&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/&quot;&gt;Legacy spreadsheet sample files in the OPF Format Corpus&lt;/a&gt; (this includes the WQ1 and WQ2 files used here)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/wq2/vlookup-compat-demo&quot;&gt;VLOOKUP compatibility demo&lt;/a&gt; (showcases VLOOKUP compatibility issue)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/office/spreadsheet/wq2/external-reference-demo&quot;&gt;External reference demo&lt;/a&gt; (showcases cached values issue)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://bugs.documentfoundation.org/show_bug.cgi?id=166706&quot;&gt;LibreOffice bug report&lt;/a&gt; on inconsistent handling of cached values in Quattro Pro for DOS (WQ2) spreadsheets&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://sourceforge.net/p/libwps/wiki/Home/&quot;&gt;Microsoft Works format import library (libwps)&lt;/a&gt; (this is the library used by LibreOfice Calc to read the Quattro Pro formats, as well as the Microsoft Multiplan for DOS and Lotus 123 formats)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Here’s &lt;a href=&quot;https://web.archive.org/web/20150117070210/https://wiki.documentfoundation.org/Feature_Comparison:_LibreOffice_-_Microsoft_Office&quot;&gt;a snapshot of the same feature matrix&lt;/a&gt; from January 2015. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;The screenshots in this post are all from the (rather old) 6.4.7.2 version; I later re-did some of my tests in LibreOffice 24.2.7.2, which gave identical results. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;This most likely applies to quattro Pro’s @HLOOKUP function as well. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2025/05/28/quattro-pro-for-dos-revisited-an-obsolete-format-no-more</link>
                <guid>https://bitsgalore.org/2025/05/28/quattro-pro-for-dos-revisited-an-obsolete-format-no-more</guid>
                <pubDate>2025-05-28T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Emulating Microsoft Multiplan spreadsheets in DOSBox-X</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/multiplan-box.jpg&quot; alt=&quot;Photo of original box in which Multiplan 2.0 was shipped.&quot; /&gt;
  &lt;figcaption&gt;Image sourced from &lt;a href=&quot;https://archive.org/details/microsoft-multiplan-2.00-5.25-2.-7z/&quot;&gt;Internet Archive&lt;/a&gt;, license unknown.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Last week I was contacted by Tristan Zondag from the Netherlands Institute for Sound and Vision for advice on files he had recovered from some old floppy disks as part of an ongoing &lt;a href=&quot;https://www.avanet.nl/digitale-archeologie-graven-naar-oude-data/&quot;&gt;digital archaeology project&lt;/a&gt;. Based on the files’ byte structure, Tristan suspected that these were &lt;a href=&quot;https://en.wikipedia.org/wiki/Multiplan&quot;&gt;Multiplan&lt;/a&gt; files. Multiplan is a spreadsheet application that was developed by Microsoft between 1982 and 1990.&lt;/p&gt;

&lt;p&gt;However, the files weren’t recognised by &lt;a href=&quot;https://digital-preservation.github.io/droid/&quot;&gt;DROID&lt;/a&gt;, and Tristan also wasn’t able to open them in the original Multiplan software, which is why he asked me to have a look at it. In response to this I did some tests where I ran old MS-DOS versions of the Multiplan software in &lt;a href=&quot;https://dosbox-x.com/&quot;&gt;DOSBox-X&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The main purpose of this post is to document how I made this work. This is probably of interest to others who are working with old Multiplan files. It might also serve as a useful introduction to emulating old MS-DOS software in DOSBox-X.&lt;/p&gt;

&lt;p&gt;Since the files from Sound and Vision are subject to access restrictions, I’m not able to share them here. So, all examples in this post are based on publicly available Multiplan files. In the final sections I briefly explain how I used the emulation to make the data in those files accessible in modern software, and I also discuss some alternative options.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;about-the-multiplan-formats&quot;&gt;About the Multiplan formats&lt;/h2&gt;

&lt;p&gt;First of all it’s important that the version history of Multiplan and its file formats is quite convoluted. In addition to the changes to the format between successive versions of the software, there are also platform-specific differences between the formats used by the DOS, Macintosh and Xenix versions of Multiplan. &lt;a href=&quot;https://preservation.tylerthorsted.com/2023/11/10/multiplan/&quot;&gt;This 2023 post by Tyler Thorsted&lt;/a&gt; gives a good overview, but most likely doesn’t even cover all variants.&lt;/p&gt;

&lt;h2 id=&quot;get-the-old-multiplan-releases&quot;&gt;Get the old Multiplan releases&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://winworldpc.com/&quot;&gt;WinWorld&lt;/a&gt; abandonware site hosts &lt;a href=&quot;https://winworldpc.com/product/multiplan/&quot;&gt;downloads for Multiplan versions 1, 2, 3 and 4&lt;/a&gt; for various platforms, as well as some scanned user manuals. A word of warning though: it seems that later versions of the software only offer limited support for earlier versions of the the format.&lt;/p&gt;

&lt;p&gt;As I was dealing with one of the earlier Multiplan formats here, I used the &lt;a href=&quot;https://winworldpc.com/download/c396c2a6-0f5d-3bc3-b611-c3a4c2a83d70&quot;&gt;Microsoft Multiplan 2.01 (5.25) (alt)&lt;/a&gt; for MS-DOS download, and most of what follows will cover this version and its associated peculiarities. Your mileage may vary for other versions and platforms.&lt;/p&gt;

&lt;h2 id=&quot;emulation-platform-dosbox-x&quot;&gt;Emulation platform: DOSBox-X&lt;/h2&gt;

&lt;p&gt;In order to run old MS-DOS software on a modern PC, we need some sort of emulation or virtualisation platform. For my initial tests I used an old virtual machine that I had set up ages ago in VirtualBox, using disk images of original MS-DOS and Windows 3.11 installation floppies. Since Tristan mentioned he had been using &lt;a href=&quot;https://dosbox-x.com/&quot;&gt;DOSBox-X&lt;/a&gt;, I then gave that a spin as well (I had never even used it before myself!). As it turned out, this worked even better than my VirtualBox setup, and it is also much simpler to set up and operate. So, at this point I decided to abandon my VirtualBox setup, and continue with DOSBox-X.&lt;/p&gt;

&lt;p&gt;For my tests I used DOSBox-X v2025.05.03, Linux SDL2 64-bit on Linux Mint (Flatpak installer). After launching the software, it shows the following welcome screen:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/dosbox-startup.png&quot; /&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;create-work-directory-and-mount-as-drive&quot;&gt;Create work directory and mount as drive&lt;/h2&gt;

&lt;p&gt;At startup DOSBox-X only shows a Z: drive, which is an internal virtual device that contains the DOS environment. In order to install software and work with our own files, we first need to create an empty work directory on our host machine (in my case a Linux desktop machine), and mount that as a new drive in DOSBox-X.&lt;/p&gt;

&lt;p&gt;To mount this work directory to a drive, go to the “Drive” menu in DOSBox-X, pick a drive letter from the drop-down list (I used “X”), and select “Mount folder as hard drive”. Then navigate to the work directory you just created. In response, DOSBox-X shows this confirmation dialog:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/dosbox-drive-mounted.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;We can set this newly created drive as the work directory in DOSBox-X using:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Z:\&amp;gt;X:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then generate a directory listing with the “dir” command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\&amp;gt;dir
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This should result in something like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; Volume in drive X is X_DRIVE
 Volume Serial Number is 0000-1234
 Directory of X:\

.              &amp;lt;DIR&amp;gt;            05/15/2025  3:22p
..             &amp;lt;DIR&amp;gt;            05/15/2025  3:23p
    0 File(s)                 0 Bytes
    2 Dir(s)      2,013,265,920 Bytes free
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;create-multiplan-installation-folder&quot;&gt;Create Multiplan installation folder&lt;/h2&gt;

&lt;p&gt;Now create an installation folder for Multiplan using this command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\&amp;gt;md mp_2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;mount-multiplan-floppy-image-files&quot;&gt;Mount Multiplan floppy image files&lt;/h2&gt;

&lt;p&gt;Unpack your downloaded 7z file with the Multiplan binaries. After unpacking, you’ll see a directory “Microsoft Multiplan 2.01 (5.25)”, which contains two image files:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Install.img&lt;/li&gt;
  &lt;li&gt;Program.img&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These can be mounted as virtual floppy disks in DOSBox-X. To do so, go the the “Drive” menu again, pick the “A” drive, and select “Mount a disk or CD image file”. Then navigate to the folder with the Multiplan disk images, and select the “Install.img” file. This again brings up a confirmation dialog.&lt;/p&gt;

&lt;p&gt;If all goes well, you can now access the image file like a regular floppy. For example, you can show a directory listing using:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\&amp;gt;dir A:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; Volume in drive A has no label
 Volume Serial Number is 0000-1234
 Directory of A:\

CONVERTD EXE             34,944 02/10/1986 12:00p
IBMSETUP COM             19,808 01/05/1987  4:24p
INSTALL  COM              6,934 02/10/1986 12:00p
INSTALL  DAT             48,145 02/10/1986 12:00p
INSTALL  MSG             14,876 02/10/1986 12:00p
INSTALL  OVL             37,498 02/10/1986 12:00p
INSTALL  SPC                128 02/10/1986 12:00p
PATCH21  EXE             12,040 02/10/1986 12:00p
    8 File(s)           174,373 Bytes
    0 Dir(s)            183,296 Bytes free
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;copy-files-from-virtual-floppies&quot;&gt;Copy files from virtual floppies&lt;/h2&gt;

&lt;p&gt;I was unable to install this version of Multiplan directly from the virtual floppies&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Instead, I managed to make things work by first copying all files from the virtual floppies to the installation folder in my work directory. To do this, first set the work directory to the Multiplan installation folder using:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\&amp;gt;cd mp_2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then copy all files from the virtual floppy to this folder:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\MP_2&amp;gt;copy a:\*.* .
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; CONVERTD.EXE
 IBMSETUP.COM
 INSTALL.COM
 INSTALL.DAT
 INSTALL.MSG
 INSTALL.OVL
 INSTALL.SPC
 PATCH21.EXE
   8 File(s) copied.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now unmount the floppy (“Drive”/ “A”/”Unmount drive”), and mount the second image file (“Program.img”). Repeat the above command to copy all files from the second floppy to the “mp_2” folder. This adds another 17 files.&lt;/p&gt;

&lt;h2 id=&quot;install-multiplan&quot;&gt;Install Multiplan&lt;/h2&gt;

&lt;p&gt;Install Multiplan by typing:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\MP_2&amp;gt;install
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The installer will prompt you to enter a terminal type. Honestly I don’t know much about the differences between the myriad of choices here, but value “2” (Microsoft Windows) worked fine for me. After hitting the Return key the installation completes.&lt;/p&gt;

&lt;h2 id=&quot;installation-of-later-multiplan-versions&quot;&gt;Installation of later MultiPlan versions&lt;/h2&gt;

&lt;p&gt;The installation procedure is more straighforward for later MultiPlan versions. For example, to install &lt;a href=&quot;https://winworldpc.com/download/f3e515e9-39ec-11ee-8470-0200008a0da4&quot;&gt;the final 4.20 release&lt;/a&gt;, you can simply mount the image file of the first installation floppy to the A: drive, and then run:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\&amp;gt;a:\setup
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This launches an interactive installer that will install MultiPlan to a user-defined directory on your virtual hard drive. The installer prompts you to mount the other floppy images along the way.&lt;/p&gt;

&lt;h2 id=&quot;run-multiplan&quot;&gt;Run Multiplan&lt;/h2&gt;

&lt;p&gt;To start Multiplan, simply type “mp” in the terminal&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;X:\MP_2&amp;gt;mp
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now Multiplan launches with this gloriously oldschool interface:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/multiplan-startup.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;At the bottom of the screen you see two rows of commands. You can select a command by typing its first letter (e.g. “t” will activate the “Transfer” command), or alternatively you can use the Tab key to move between the commands.&lt;/p&gt;

&lt;h2 id=&quot;load-a-file&quot;&gt;Load a file&lt;/h2&gt;

&lt;p&gt;To load a file, type “t”, which activates the “Transfer” command:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/multiplan-transfer.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Then type “l” (“Load”), and press one of the arrow keys on your keyboard. This results in a list of all files in the installation directory:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/multiplan-load.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;All files with a “.mod” file extension here are Multiplan spreadsheets, so let’s load the first one (“amor.mod”):&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/multiplan-amor.png&quot; /&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;load-your-own-files&quot;&gt;Load your own files&lt;/h2&gt;

&lt;p&gt;Confusingly, it seems that this Multiplan version doesn’t allow you to load files that are not in the installation directory&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. So, to load your own files, you first need to copy them to the Multiplan installation directory. Then instruct DOSBox-X to rescan the drive (“Drive”/”X”/”Rescan drive”). Next time you launch the “Load” command, the new files will appear. As an example, below I loaded one of &lt;a href=&quot;https://github.com/thorsted/PRONOM_Research/tree/main/Submissions/Microsoft%20Excel%20v1/Samples/MultiPlan%20v1%20DOS&quot;&gt;Tyler’s Multiplan 1 example files&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/multiplan-thorsted.png&quot; /&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;using-multiplan-data-in-modern-software&quot;&gt;Using Multiplan data in modern software&lt;/h2&gt;

&lt;p&gt;When I started these tests, I expected that it would be quite difficult to make data in Multiplan files available in modern software. Most online resources I found on this suggested to use the original software to read any old Multiplan files, export the contents to &lt;a href=&quot;https://en.wikipedia.org/wiki/Symbolic_Link_(SYLK)&quot;&gt;SYLK&lt;/a&gt; format, and then import that SYLK file into a modern spreadsheet application. Since neither Microsoft Excel nor LibreOffice Calc were able to read the file Tristan sent me, I initially tried that route. However, as far as I can tell, Multiplan 2 cannot export spreadsheets to SYLK (or any other format for that matter!).&lt;/p&gt;

&lt;h2 id=&quot;libreoffice-can-read-re-saved-file&quot;&gt;LibreOffice can read re-saved file&lt;/h2&gt;

&lt;p&gt;Unexpectedly, re-saving the file (which I suspect was originally created by one of the Multiplan 1.x releases) in Multiplan resulted in a version 2 file that I could open without problems in &lt;a href=&quot;https://www.libreoffice.org/discover/calc/&quot;&gt;LibreOffice Calc&lt;/a&gt;. To my surprise, Calc has &lt;a href=&quot;https://wiki.documentfoundation.org/Feature_Comparison:_LibreOffice_-_Microsoft_Office&quot;&gt;import support&lt;/a&gt; for the v1-3 DOS versions of the Multiplan format. It’s not entirely clear to me why the v1 source file couldn’t be opened, but perhaps this was simply some unsupported variant of the format. For such cases, re-saving the file&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; (while of course keeping a copy of the original!) in Multiplan provides a useful way to make the data available in LibreOffice Calc. In turn, Calc can then be used to migrate the data to some modern format&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;export-data-using-the-print-command&quot;&gt;Export data using the Print command&lt;/h2&gt;

&lt;p&gt;Multiplan’s “Print” command (“p”) provides another (but less sophisticated) way to export data. In the simplest case we can use the print to file (“f”) command, which creates a plain text dump:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/05/multiplan-print-file.png&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;For best results you might want to experiment with the “Margins” settings (“m”), as the default settings often won’t preserve the column structure.&lt;/p&gt;

&lt;p&gt;It’s also possible to print to a Postscript file. In order to make this work, you first need to check that your DOSBox-X configuration file contains the following settings&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[parallel]
parallel1=printer

[printer]
printer=true
printoutput=ps
multipage=true
docpath=~/kb/multiplan-dosbox/export
timeout=1000
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On my Linux Mint machine (on which I installed the Flatpak version of DOSBox-X) the configuration file is located at&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;~/.var/app/com.dosbox_x.DOSBox-X/config/dosbox-x
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You need to close and re-start DOSBox-X for any changes in the configuration file to take effect.&lt;/p&gt;

&lt;p&gt;Now, if you print to “Printer”, Multiplan creates a Postscript file in the directory that is defined by the “docpath” configuration variable. You can easily convert this to PDF using &lt;a href=&quot;https://ghostscript.com/&quot;&gt;Ghostscript&lt;/a&gt;’s “ps2pdf” tool:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ps2pdf doc1.ps
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In my tests I noticed that printing to Postscript resulted in Postscript files where each page is rendered as a bitmapped image, which makes them of limited use. It’s not clear to me if this is a limitation of DOSBox-X, or something that can be resolved with some further configuration tweaks.&lt;/p&gt;

&lt;h2 id=&quot;final-remarks&quot;&gt;Final remarks&lt;/h2&gt;

&lt;p&gt;The Multiplan file format provides another nice example of &lt;a href=&quot;https://blog.dshr.org/2009/01/are-format-specifications-important-for.html&quot;&gt;David Rosenthal’s 2009 assertion&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Clearly, formats with open source renderers are, for all practical purposes, immune from format obsolescence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For the specific case of Tristan’s file, the open source renderer did need a little help from the original software though. It also highlights the importance of open source software such as LibreOffice for keeping data in such legacy formats usable.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to Tristan Zondag (the Netherlands Institute for Sound and Vision) for the various useful suggestions in our e-mail exchange about the Multiplan files.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://preservation.tylerthorsted.com/2023/11/10/multiplan/&quot;&gt;Tyler Thorsted’s blog post on Multiplan&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/thorsted/PRONOM_Research/tree/main/Submissions/Microsoft%20Excel%20v1/Samples&quot;&gt;Tyler Thorsted’s Multiplan sample files&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://discmaster.textfiles.com/search?format=multiplanSpreadsheet&quot;&gt;Multiplan sample files in DiscMaster&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://winworldpc.com/product/multiplan/&quot;&gt;Multiplan downloads on WinWorld&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://dosbox-x.com/&quot;&gt;DOSBox-X website&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.avanet.nl/digitale-archeologie-graven-naar-oude-data/&quot;&gt;Blog post by Tristan Zondag about Sound and Vision’s digital archaeology project&lt;/a&gt; (in Dutch)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;The software was meant to be run directly from a set of floppies, without an actual harddisk installation. This was not uncommon for early to mid ’80s software. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;For some reason this &lt;em&gt;only&lt;/em&gt; seems to work if the command is launched from the installation directory (e.g. typing the full path from some other directory did not work for me). &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;According to the &lt;a href=&quot;https://winworldpc.com/download/79c2a9c3-82c3-9a6e-1911-c3a6e280947e&quot;&gt;documentation&lt;/a&gt; it should be possible to configure a some other data location, but I wasn’t able to make this work. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;To re-save a file, type “t” to activate the “Transfer” command, and then “s” to save. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;LibreOffice also allows you to do this from the command line, which is useful for bulk processing. See: &lt;a href=&quot;https://help.libreoffice.org/latest/he/text/shared/guide/convertfilters.html&quot;&gt;https://help.libreoffice.org/latest/he/text/shared/guide/convertfilters.html&lt;/a&gt;. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;See &lt;a href=&quot;https://dosbox-x.com/wiki/Guide:Setting-up-printing-in-DOSBox%E2%80%90X#_example_print_to_postscript&quot;&gt;this explanation on the DOSBox-X wiki&lt;/a&gt;. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;This folder also contains a “capture” directory, which is where DOSBox-X stores any screenshots made with “Take screenshot” (F12+P) from the “Capture” menu. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2025/05/19/emulating-microsoft-multiplan-spreadsheets-in-dosbox-x</link>
                <guid>https://bitsgalore.org/2025/05/19/emulating-microsoft-multiplan-spreadsheets-in-dosbox-x</guid>
                <pubDate>2025-05-19T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Changes to the blog&#58; migration to Codeberg and ActivityPub-based comments</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2025/04/donkey-cart-codeberg.png&quot; alt=&quot;Illustration of a stylized, icon-like donkey that is pulling a cart with the words bitsgalore.org written on it, walking from left to right. On its left is the Github Octicon logo, and to its right the Codeberg logo, which depicts a mountain.&quot; /&gt;
  &lt;figcaption&gt;Donkey and cart icons licensed from &lt;a href=&quot;https://thenounproject.com/&quot;&gt;the Noun Project&lt;/a&gt;. Github Octicon icon from &lt;a href=&quot;https://en.m.wikipedia.org/wiki/File:Octicons-mark-github.svg&quot;&gt;Wikimedia Commons&lt;/a&gt;, released under &lt;a href=&quot;https://github.com/github/octicons/blob/master/LICENSE&quot;&gt;MIT&lt;/a&gt; license. Codeberg logo from &lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Codeberg_logo.svg&quot;&gt;Wikimedia Commons&lt;/a&gt;, released under &lt;a href=&quot;https://creativecommons.org/publicdomain/zero/1.0/deed.en&quot;&gt;CC0 1.0&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Ever since its start in late 2013, this blog has been hosted on Github Pages, using the &lt;a href=&quot;https://jekyllrb.com/&quot;&gt;Jekyll&lt;/a&gt; static site generator. On a technical level this always worked flawlessly, but in the current geopolitical climate I no longer want my site being hosted at a US-based tech giant. After reviewing some options, I decided to migrate the site to &lt;a href=&quot;https://codeberg.page/&quot;&gt;Codeberg Pages&lt;/a&gt;, which is operated by a non-profit organization that is based in Germany. I also implemented a new comments system that is based on ActivityPub. This allows readers to post comments with a Fediverse (e.g. Mastodon) account.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;overview&quot;&gt;Overview&lt;/h2&gt;

&lt;p&gt;This post is quite technical in its nature, and is primarily aimed at readers who are also considering to move their website away from Github. I’m assuming some familiarity with building and configuring static websites and Jekyll. First I give an overview of the main differences between Github Pages and Codeberg Pages. Then I explain how I implemented the ActivityPub-based comments system, and how I created an archive of the legacy (Github-based) comments. In the final sections I briefly explain the configuration I used to make the site work on my custom web domain.&lt;/p&gt;

&lt;h2 id=&quot;github-pages-vs-codeberg-pages&quot;&gt;Github pages vs Codeberg pages&lt;/h2&gt;

&lt;p&gt;Unlike Github, Codeberg doesn’t have a built-in Jekyll integration. This means that for a Jekyll-based website (like mine), you explicitly need to build the static HTML locally from the source, and then upload the HTML to a dedicated repo. There are various ways to do this. In my case, I simply created 2 separate repos: &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src&quot;&gt;one with the source data&lt;/a&gt;, and &lt;a href=&quot;https://codeberg.org/bitsgalore/pages&quot;&gt;a “pages” repo with the generated HTML&lt;/a&gt;. To make this setup work, you need to add a “destination” variable to the source repo’s &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_config.yml&quot;&gt;_config.yml&lt;/a&gt; file, and set the value to the path of the “pages” repo. In my case this is a sibling directory called “pages”:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;destination: ../pages/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, the main difference with the Github-based workflow is that pushing changes to the source repo doesn’t affect the live site, and that in order to update the live site one has to push changes to the “pages” repo&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;comments-section&quot;&gt;Comments section&lt;/h2&gt;

&lt;p&gt;In 2020 I added a comments feature to this site, which allowed readers to post comments &lt;a href=&quot;https://aristath.github.io/blog/static-site-comments-using-github-issues-api&quot;&gt;using Github issues&lt;/a&gt;. This introduced two challenges to the current migration. First, I had to think of an alternative commenting system that is independent on Github. Since I didn’t want to lose any existing comments, it also meant I had to devise a way to archive those, and incorporate them into the site.&lt;/p&gt;

&lt;h2 id=&quot;activitypub-based-comments&quot;&gt;ActivityPub-based comments&lt;/h2&gt;

&lt;p&gt;I briefly considered a comments system based on Codeberg issues (which has an API that is similar to Github). I ultimately decided against this, mostly because of Codeberg’s much smaller user base. After some thought, I realised that &lt;a href=&quot;https://en.wikipedia.org/wiki/ActivityPub&quot;&gt;ActivityPub&lt;/a&gt;-based comments would be a much better option, as this allows anyone with a Fediverse (e.g. Mastodon) account to post comments. This idea isn’t new, and several others have been doing this already&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;

&lt;p&gt;The way this works is actually pretty simple. Fediverse services such as Mastodon have a public API, which allows you to get a Toot and all its replies as structured data. As an example, consider this Toot:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://digipres.club/@bitsgalore/114387914694269341&quot;&gt;https://digipres.club/@bitsgalore/114387914694269341&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following API call uses the Toot’s unique identifier, which returns the corresponding data in JSON format:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://digipres.club/api/v1/statuses/114387914694269341/context&quot;&gt;https://digipres.club/api/v1/statuses/114387914694269341/context&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With some JavaScript it’s possible to parse these data files, and inject the parsed content back into a page’s HTML. To a visitor this is then visible as a comments thread. See &lt;a href=&quot;https://jan.wildeboer.net/2023/02/Jekyll-Mastodon-Comments/&quot;&gt;Jan Wildeboer’s post&lt;/a&gt; for a more detailed explanation.&lt;/p&gt;

&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;/h2&gt;

&lt;p&gt;For readers who are interested in the nitty-gritty of things, my implementation can be found &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_includes/comments.html&quot;&gt;here&lt;/a&gt;. The JavaScript is actually a straightforward adaptation of &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/commit/d0a4cc139143d497b2dca27c21a484fd35486dd2/_includes/comments.html&quot;&gt;the code I previously used for the Github comments&lt;/a&gt;. This requires that each page (blog post) for which comments are enabled has an “ap_id” identifier. This is the aforementioned unique Toot identifier, and it is defined in the YAML frontmatter of each post’s Markdown source file. Here’s an example (full file &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_posts/2024/2024-10-30-jpeg-quality-estimation-using-simple-least-squares-matching-of-quantization-tables.md&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;---
ap_id: 114416014116498692
comment_id: 92
description: This post describes a simple method for estimating JPEG compression quality.
  It is based on a straightforward comparison of a file&apos;s quantization tables against
  the quantization tables from the JPEG standard using least squares matching. It
  also proposes a measure to characterize the similarity of an image&apos;s quantization
  tables to these standard tables, which is useful for assessing the accuracy of the
  quality estimate.
headImage: /images/2024/10/quality-sign.jpg
headImageAltText: Photograph of faded sign on building front showing the word &apos;Quality&apos;.
layout: post
tags:
- JPEG
- ImageMagick
- ExifTool
title: JPEG quality estimation using simple least squares matching of quantization
  tables
---
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;comments-bot&quot;&gt;Comments bot&lt;/h2&gt;

&lt;p&gt;Although I could have used my &lt;a href=&quot;https://digipres.club/@bitsgalore&quot;&gt;regular Fediverse account on digipres.club&lt;/a&gt; for initiating comment threads, people might unknowingly end up in my blog’s comments section by replying to a Toot. To reduce the chances of this from happening, I set up a &lt;a href=&quot;https://digipres.club/@bitsgaloreBlogComments&quot;&gt;dedicated blog comments account for this instead&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;From the outset I wanted to enable commenting for &lt;em&gt;all&lt;/em&gt; posts on my blog (currently 80). This means that for each post the following actions are needed:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create a Toot that initiates the comments thread.&lt;/li&gt;
  &lt;li&gt;Add the identifier of this Toot to the post’s YAML frontmatter as the “ap_id” variable.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Doing this manually for 80 posts would be very tedious, so instead I wrote a &lt;a href=&quot;https://codeberg.org/bitsgalore/commentsBot&quot;&gt;commentsBot&lt;/a&gt; application that completely automates this. It uses &lt;a href=&quot;https://github.com/halcy/Mastodon.py&quot;&gt;Mastodon.py&lt;/a&gt; to post to the blog comments Fediverse account, and &lt;a href=&quot;https://github.com/eyeseast/python-frontmatter&quot;&gt;Python Frontmatter&lt;/a&gt; to add the corresponding identifiers to the YAML frontmatter. I primarily wrote the comments bot for this blog, but I imagine the code might be useful to others as well.&lt;/p&gt;

&lt;h2 id=&quot;archiving-the-legacy-github-comments&quot;&gt;Archiving the legacy Github comments&lt;/h2&gt;

&lt;p&gt;To preserve the legacy (Github-based) comments, I first downloaded the JSON files that contain the comments data using &lt;a href=&quot;https://codeberg.org/bitsgalore/blogResources/src/branch/main/fetch-github-comments.py&quot;&gt;this script&lt;/a&gt;&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. I then added all these files to a &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_data/comments-gh&quot;&gt;dedicated subdirectory of the source repo’s &lt;em&gt;_data&lt;/em&gt; folder&lt;/a&gt;. The base name of each JSON file corresponds to the value of the “comment_id” variable in the YAML frontmatter of each post’s Markdown source file.&lt;/p&gt;

&lt;p&gt;As an example, for &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_posts/2020/2020-03-11-does-microsoft-onedrive-export-large-ZIP-files-that-are-corrupt.md&quot;&gt;this post&lt;/a&gt; the “comment_id” is 70. So, the corresponding data file with the comments is:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_data/comments-gh/70.json&quot;&gt;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_data/comments-gh/70.json&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I wrote some &lt;a href=&quot;https://shopify.github.io/liquid/&quot;&gt;Liquid&lt;/a&gt; code that &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/commit/e98f61bf65ba046dd132d90ac8db84da8ff72396/_includes/comments.html#L87&quot;&gt;parses the JSON into styled HTML&lt;/a&gt;. It uses the exact same CSS classes that are also used for the ActivityPub comments, so both are rendered with identical formatting. Scroll down to the comments section of &lt;a href=&quot;/2021/09/06/pdf-processing-and-analysis-with-open-source-tools.html&quot;&gt;this post&lt;/a&gt; to see what this looks like.&lt;/p&gt;

&lt;h2 id=&quot;making-the-domains-file-work&quot;&gt;Making the .domains file work&lt;/h2&gt;

&lt;p&gt;Another thing to watch out for: if (like me) you serve your site on a &lt;a href=&quot;https://docs.codeberg.org/codeberg-pages/using-custom-domain/&quot;&gt;custom domain&lt;/a&gt;, you need to define this domain (and any subdomains) in a “.domains” file (&lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/.domains&quot;&gt;here’s mine&lt;/a&gt;). However, by default Jekyll ignores “hidden” files that start with “.”, which means they won’t be included in the generated HTML! This in turn will break the domain configuration. The solution is to add any “hidden” files or directories to the “include” variable in &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/_config.yml&quot;&gt;_config.yml&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;# By default Jekyll ignores files and dirs starting with &quot;.&quot;, so need to bypass
# for &quot;.well-known&quot; dir and &quot;.domains&quot; file
include: [&quot;.well-known&quot;, &quot;.domains&quot;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;making-dnssec-work-with-custom-domains&quot;&gt;Making DNSSEC work with custom domains&lt;/h2&gt;

&lt;p&gt;My bitsgalore.org domain is secured with &lt;a href=&quot;https://en.wikipedia.org/wiki/Domain_Name_System_Security_Extensions&quot;&gt;DNSSEC&lt;/a&gt;. I initially had some difficulty getting it to work on the www subdomain. I eventually made this work (&lt;a href=&quot;https://codeberg.org/Codeberg/Community/issues/1881&quot;&gt;thanks to a suggestion by one of the Codeberg admins&lt;/a&gt;) using the following configuration:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create A, AAAA and TXT records for both the apex domain and the www subdomain.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Don’t&lt;/strong&gt; create a CNAME record.&lt;/li&gt;
  &lt;li&gt;Define both the apex domain and the www subdomain in the .domains file.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In my case this looks like this:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Domain&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Type&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Data&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;A&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;217.197.84.141&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;www&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;A&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;217.197.84.141&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AAAA&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2a0a:4580:103f:c0de::2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;www&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AAAA&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2a0a:4580:103f:c0de::2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;TXT&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;bitsgalore.codeberg.page&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;www&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;TXT&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;bitsgalore.codeberg.page&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;And &lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/src/branch/main/.domains&quot;&gt;my .domains file&lt;/a&gt; contains the following lines:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;bitsgalore.org
www.bitsgalore.org
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;Unlike Github, Codeberg is run by a non-profit organization that is funded through donations. Many Fediverse/Mastodon instances are run by volunteers as well. If this post inspired you to move your website there as well, consider &lt;a href=&quot;https://donate.codeberg.org/&quot;&gt;making a donation to Codeberg&lt;/a&gt; (or &lt;a href=&quot;https://join.codeberg.org/&quot;&gt;become a Codeberg member&lt;/a&gt;), and check if your Fediverse instance accepts donations.&lt;/p&gt;

&lt;p&gt;Thanks are due to Codeberg and the admins at &lt;a href=&quot;https://digipres.club/&quot;&gt;digipres.club&lt;/a&gt; for making this possible!&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://codeberg.org/bitsgalore/blog-src/&quot;&gt;Blog source repo&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://codeberg.org/bitsgalore/commentsBot&quot;&gt;commentsBot code&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://codeberg.org/bitsgalore/blogResources&quot;&gt;blogResources&lt;/a&gt; - various utility scripts I’ve written for managing this website&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;I supppose it would also be possible to implement this as two separate branches in one single repo, but I haven’t looked into this. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;See eg. &lt;a href=&quot;https://jan.wildeboer.net/2023/02/Jekyll-Mastodon-Comments/&quot;&gt;Jan Wildeboer&lt;/a&gt; and &lt;a href=&quot;https://cassidyjames.com/blog/fediverse-blog-comments-mastodon/&quot;&gt;Cassidy James Blaede&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;It’s important to use a Github token for this: although it’s possible to use the Github API without a token, you’re likely to hit the API’s rate limit pretty quickly, with the result that the script won’t download all JSON files. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2025/04/30/changes-to-the-blog-migration-to-codeberg-and-activitypub-based-comments</link>
                <guid>https://bitsgalore.org/2025/04/30/changes-to-the-blog-migration-to-codeberg-and-activitypub-based-comments</guid>
                <pubDate>2025-04-30T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Y2K</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/12/ddac-logo-oscilloscope.jpg&quot; alt=&quot;Digital Dark Age logo, which depicts a human figure holding a large floppy disk, with to its left the words Digital Dark Age Crew. The background shows the screen of an oscilloscope.&quot; /&gt;
  &lt;figcaption&gt;Digital Dark Age Crew logo.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;One of the most elusive items in the &lt;a href=&quot;/2022/11/03/wheel-out-the-digital-dark-age-klaxon&quot;&gt;Digital Dark Age Crew&lt;/a&gt; back catalogue is “Y2K”, which deals with the &lt;a href=&quot;https://en.wikipedia.org/wiki/Year_2000_problem&quot;&gt;Year 2000 problem&lt;/a&gt;. Originally planned as a December 1999 release, the track was never finished due to a succession of technical problems. Some early demos of “Y2K” have surfaced as bootlegs, and many fans of the group rate these amongst the most sought-after Digital Dark Age Crew tracks.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;lost-and-found&quot;&gt;Lost and found&lt;/h2&gt;

&lt;p&gt;The original recordings were thought to have been lost in the fire that occurred while the group were shooting their infamous 2007 “Watch the Digital Dark Age Crew Burn a Million PDFs” video. Earlier this year, the original DAT tapes of the “Y2K” sessions were unexpectedly found in a derelict taxidermy shop. The finder, who prefers to remain anonymous, kindly returned the tapes to sole remaining Digital Dark Age Crew member Marinus Nullbyte.&lt;/p&gt;

&lt;h2 id=&quot;definitive-version&quot;&gt;Definitive version&lt;/h2&gt;

&lt;p&gt;Listening to these early recordings again after 25 years, Nullbyte was struck at how the central message of the track hadn’t lost any of its power. So, Nullbyte embarked on a journey to finally finish “Y2K”, using the rough demos as a guide. This has resulted in the “definitive” version of “Y2K”, which will no doubt delight Digital Dark Age Crew fans all around the globe.&lt;/p&gt;

&lt;h2 id=&quot;video&quot;&gt;Video&lt;/h2&gt;

&lt;p&gt;Nullbyte also produced a brand new video to accompany the release of “Y2K”.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/12/composite-y2k.jpg&quot; /&gt;
  &lt;figcaption&gt;Composite of stills from the video of the track &quot;Y2K&quot; by the Digital Dark Age Crew.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Enjoy, and remember: tomorrow, we could all be living in the dark ages!&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
&lt;iframe width=&quot;750&quot; height=&quot;422&quot; src=&quot;https://www.youtube-nocookie.com/embed/pJ86meMFqT8&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;Digital Dark Age Crew - Y2K.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Direct link to video on YouTube, in case the embedded player doesn’t work:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://youtu.be/pJ86meMFqT8&quot;&gt;https://youtu.be/pJ86meMFqT8&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2024/12/30/y2k</link>
                <guid>https://bitsgalore.org/2024/12/30/y2k</guid>
                <pubDate>2024-12-30T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>PDF Quality assessment for digitisation batches with Python, PyMuPDF and Pillow</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/12/Control_room_pt_tupper.jpg&quot; alt=&quot;Photo of the interior of the control room in a fossil fuel power plant. The walls at the back are completely covered with large control panels. In front of it an operator sits at a desk in his chair.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Control_room_pt_tupper.jpg&quot;&gt;Control room in Fossil fuel power plant in Point Tupper, Nova Scotia&lt;/a&gt;. Achim Hering, Public domain, via Wikimedia Commons.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This post introduces &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/tree/main&quot;&gt;Pdfquad&lt;/a&gt;, a software tool that for automated quality assessment for large digitisation batches. The software was developed specifically for the Digital Library for Dutch Literature (DBNL), but it might be adaptable to other users and organisations as well.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;context-of-this-work&quot;&gt;Context of this work&lt;/h2&gt;

&lt;p&gt;The Digital Library for Dutch Literature (&lt;a href=&quot;https://www.dbnl.org/&quot;&gt;DBNL&lt;/a&gt;) is a collection of literary texts from the Dutch language area. It is a collaboration between the Dutch Language Union, the Flanders Heritage Libraries and the KB. As part of its mission, DBNL digitises books and periodicals, which are then made available on its website.&lt;/p&gt;

&lt;p&gt;Most of the digitisation work is contracted out to external suppliers. Unlike most of the KB’s digitised collections, the DBNL material is scanned to PDF format. For each publication, two versions are created:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A production master PDF, which serves as the source material for several derived products (PDF and EPUB versions with high quality, manually transcribed text).&lt;/li&gt;
  &lt;li&gt;A (relatively) small access PDF with the scanned pages, which is made available on the DBNL website.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href=&quot;https://www.dbnl.org/tekst/eeml001duis02_01/&quot;&gt;Here’s an example publication&lt;/a&gt;. Note the “Downloads” section, which contains links to the access PDFs and EPUB.&lt;/p&gt;

&lt;h2 id=&quot;pdf-requirements&quot;&gt;PDF requirements&lt;/h2&gt;

&lt;p&gt;The scanned PDFs need to conform to requirements that are defined in a technical specifications document, which has been updated last year. It covers both aspects related to capture quality, as well as technical characteristics of the PDF. The latter category includes things like:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Scanned pages must be encoded using JPEG compression, using 85% quality for the production master, and 50% quality for the access file.&lt;/li&gt;
  &lt;li&gt;Pages must be scanned in full-colour, with a colour space that is defined by an embedded ICC profile.&lt;/li&gt;
  &lt;li&gt;Encryption and any other access restrictions are not allowed.&lt;/li&gt;
  &lt;li&gt;Digital signatures and watermarks are not allowed.&lt;/li&gt;
  &lt;li&gt;Any content besides the scans is not allowed (no layers or active content).&lt;/li&gt;
  &lt;li&gt;The PDF must not open with thumbnails.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thus far, conformance to such technical aspects was mostly tested manually on a sample basis, using a variety of software tools. This situation is not ideal, especially given the fact that the PDFs are delivered as part of huge production batches. At the request of our Digitisation department, I looked into a way to do these quality checks in a more systematic and scalable manner. This has resulted in the &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad&quot;&gt;Pdfquad&lt;/a&gt; software. The name is an acronym for “PDF QUality Assessment for Digitisation batches”. It can be used to automatically assess PDF documents in a digitisation batch against one or more sets of user-defined technical requirements.&lt;/p&gt;

&lt;h2 id=&quot;general-concept&quot;&gt;General concept&lt;/h2&gt;

&lt;p&gt;The general idea behind Pdfquad isn’t entirely new. In 2013 I wrote the &lt;a href=&quot;https://github.com/KBNLresearch/jprofile&quot;&gt;Jprofile&lt;/a&gt; tool, which does automated quality checks on batches of JP2 (JPEG 2000 Part 1) images. Jprofile uses &lt;a href=&quot;https://jpylyzer.openpreservation.org/&quot;&gt;Jpylyzer&lt;/a&gt; to extract technical characteristics of a JP2 file, and then evaluates these characteristics against a set of technical requirements that are encoded as &lt;a href=&quot;https://en.wikipedia.org/wiki/Schematron&quot;&gt;Schematron&lt;/a&gt; rules. This software was used for several years operationally by our digitisation department&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Even though Jprofile was designed for a different file format, the overall concept is equally usable for batches with PDFs. So, I took the old Jprofile code as a starting point, removed all JPEG 2000 specific code, adapted it to our PDF case, and made it quite a bit more flexible in the process.&lt;/p&gt;

&lt;h2 id=&quot;pdfquad-overview&quot;&gt;Pdfquad overview&lt;/h2&gt;

&lt;p&gt;The following figure gives a general overview of how Pdfquad works:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/12/pdfquad-flow.png&quot; alt=&quot;Flowchart that shows high-level overview of Pdfquad&apos;s functional components.&quot; /&gt;
  &lt;figcaption&gt;Overview of Pdfquad&apos;s functional components and workflow.&lt;/figcaption&gt; 
&lt;/figure&gt;

&lt;p&gt;For each PDF in a batch, it extracts the relevant technical characteristics, which are serialized as XML. This XML is then assessed against a set of Schematron rules, which represent the requirements. There may be different schemas (e.g. for production masters and access PDFs). A profile defines which schema must be used for a specific PDF, based on its file name or the name of its direct parent directory. Finally, the results of both the feature extraction and the Schematron assessment are written to a set of output files. In the following sections I will describe these components in more detail.&lt;/p&gt;

&lt;h2 id=&quot;pdf-characteristics-vs-image-characteristics&quot;&gt;PDF characteristics vs image characteristics&lt;/h2&gt;

&lt;p&gt;Some of the DBNL PDF requirements reflect technical characteristics at the PDF document level (e.g. the presence of encryption), whwereas aspects like JPEG quality must be evaluated at the level of the image stream. In most cases, an image in PDF is implemented as an &lt;a href=&quot;https://github.com/pdf-association/arlington-pdf-model/blob/master/tsv/latest/XObjectImage.tsv&quot;&gt;Image XObject&lt;/a&gt;. This is a PDF dictionary with various key-value pairs that describe the image properties, followed by the image stream. In our case, we expect the image stream to contain data in JPEG format. This means we cannot just use any PDF feature extraction tool (e.g. &lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt;), as such tools are unable to provide the necessary information at the image stream level.&lt;/p&gt;

&lt;h2 id=&quot;feature-extraction-with-pymupdf-and-pillow&quot;&gt;Feature extraction with PyMuPDF and Pillow&lt;/h2&gt;

&lt;p&gt;As a solution to this, Pdfquad uses the &lt;a href=&quot;https://pymupdf.readthedocs.io/&quot;&gt;PyMuPDF&lt;/a&gt; library to parse the PDF. PyMuPDF is also used to extract all relevant PDF-level characteristics. For each image, it then passes the image stream data to the &lt;a href=&quot;https://pillow.readthedocs.io/&quot;&gt;Python Imaging Library&lt;/a&gt; (Pillow), which is used to extract the image-level characteristics. With this combination of PyMuPDF and Pillow, I was able to extract all information that is needed to verify all but one of our PDF requirements. The exception here was JPEG quality, which doesn’t immediately follow from Pillow’s output. However, Pillow can report the quantization tables of a JPEG image, and these can be used to estimate JPEG quality. My &lt;a href=&quot;/2024/10/30/jpeg-quality-estimation-using-simple-least-squares-matching-of-quantization-tables&quot;&gt;earlier post on JPEG quality estimation&lt;/a&gt; covers this in detail, and presents a least-squares matching method that is also used in Pdfquad.&lt;/p&gt;

&lt;h2 id=&quot;feature-extraction-walkthrough&quot;&gt;Feature extraction walkthrough&lt;/h2&gt;

&lt;p&gt;It’s worth pointing out that Pdfquad’s feature extraction is currently limited to those characteristics that are needed to verify the DBNL requirements. However, extending the extraction to additional characteristics would be pretty straightforward. Readers who are interested in this (or in feature extraction with PyMuPDF and Pillow in general) might want to check out Pdfquad’s &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/main/pdfquad/properties.py&quot;&gt;properties module&lt;/a&gt;. Here’s a brief walkthrough:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/7f33e7820e61a07ffb1f5c2b6c3e6e9084a57c82/pdfquad/properties.py#L57C5-L57C18&quot;&gt;getProperties&lt;/a&gt; function parses a PDF document, extracts some document-level characteristics and then iterates over all pages.&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/7f33e7820e61a07ffb1f5c2b6c3e6e9084a57c82/pdfquad/properties.py#L177&quot;&gt;getPageProperties&lt;/a&gt; function extracts all page-level characteristics, and iterates over all images on a page.&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/7f33e7820e61a07ffb1f5c2b6c3e6e9084a57c82/pdfquad/properties.py#L209&quot;&gt;getImageProperties&lt;/a&gt; function process an image object, and calls two functions that extract the dictionary-level and stream-level characteristics, respectively.&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/7f33e7820e61a07ffb1f5c2b6c3e6e9084a57c82/pdfquad/properties.py#L232&quot;&gt;getImageDictProperties&lt;/a&gt; function extracts all characteristics at the image dictionary level.&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/7f33e7820e61a07ffb1f5c2b6c3e6e9084a57c82/pdfquad/properties.py#L253&quot;&gt;getImageStreamProperties&lt;/a&gt; function extracts all characteristics at the image stream level, using Pillow.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The feature extraction code also keeps track of any exceptions that may occur during parsing. These usually indicate a problem with the PDF, and the DBNL Schematron includes rules that verify their absence.&lt;/p&gt;

&lt;h2 id=&quot;profiles-and-schemas&quot;&gt;Profiles and schemas&lt;/h2&gt;

&lt;p&gt;Pdfquad needs to understand the structure of a digitisation batch, and how the PDFs inside it must be evaluated. As an example, let’s assume we have the following batch structure:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;└── 20241105
    └── _boe012192401
        ├── 300dpi-50
        │   └── _boe012192401_01.pdf
        └── 300dpi-85
            └── _boe012192401_01.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here we have a directory tree where each “300dpi-85” directory contains a high-quality production master PDF, and each “300dpi-50” directory an access PDF. In order to analyse this batch structure, Pdfquad needs to know two things:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A file or directory name pattern that differentiates between production master and access PDFs.&lt;/li&gt;
  &lt;li&gt;A reference to the Schematron file that is used to evaluate both.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Both are defined in an XML-formatted profile, which Pdfquad expects as a command-line argument. Here’s an example:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot;?&amp;gt;&lt;/span&gt;

&lt;span class=&quot;nt&quot;&gt;&amp;lt;profile&amp;gt;&lt;/span&gt;

&lt;span class=&quot;nt&quot;&gt;&amp;lt;schema&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;parentDirName&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;match=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;endswith&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;pattern=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pi-85&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;pdf-dbnl-85.sch&lt;span class=&quot;nt&quot;&gt;&amp;lt;/schema&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;schema&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;parentDirName&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;match=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;endswith&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;pattern=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pi-50&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;pdf-dbnl-50.sch&lt;span class=&quot;nt&quot;&gt;&amp;lt;/schema&amp;gt;&lt;/span&gt;

&lt;span class=&quot;nt&quot;&gt;&amp;lt;/profile&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the profile contains two &lt;em&gt;schema&lt;/em&gt; elements, which represent the production master and access PDF, respectively. Each element refers to a corresponding Schematron file. In this case, the values of the “type”, “match” and “pattern” “attributes in the first element tell Pdfquad that if the name of a PDF’s direct parent directory ends with “pi-85”, it must use Schematron file “pdf-dbnl-85.sch”.&lt;/p&gt;

&lt;p&gt;This file, which can be &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/main/pdfquad/schemas/pdf-dbnl-85.sch&quot;&gt;viewed here&lt;/a&gt;, contains the Schematron rules that are used to evaluate the extracted characteristics.&lt;/p&gt;

&lt;p&gt;Although the profile and Schemas that are part of the current Pdfquad release are specific to the DBNL case, it is possible to use the software with custom profiles and schemas. This could extend its applicability to other PDF-based digitisation workflows and batch structures.&lt;/p&gt;

&lt;h2 id=&quot;output&quot;&gt;Output&lt;/h2&gt;

&lt;p&gt;For each batch, Pdfquad produces two types of output:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A comma-delimited summary file. This lists, for each PDF, whether the Schematron assessment completed without errors, the outcome of the Schematron assessment, the number of pages, and a reference to the comprehensive output file (see below).&lt;/li&gt;
  &lt;li&gt;One or more comprehensive output files in XML format. These contain all extracted properties, as well a the Schematron report and the assessment status. &lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/blob/main/examples/pq_batchtest_001.xml&quot;&gt;Here’s an example&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The comprehensive output can get really large. Because of this, Pdfquad splits the results across multiple output files, using the following naming convention (the actual file prefixes can be defined as a command-line option):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;pq_mybatch_001.xml&lt;/li&gt;
  &lt;li&gt;pq_mybatch_002.xml&lt;/li&gt;
  &lt;li&gt;etcetera&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By default, Pdfquad limits the number of reported PDFs for each output file to 10, after which it creates a new file. There’s a command-line option to change this behaviour.&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;Pdfquad is still in early development, and the current feature extraction component in particular is limited to the specific needs of our DBNL digitisation workflow. However, I would expect that the software might be useful for others as well. The profile and schema definitions offer a lot of flexibility for different batch structures, without any changes to the code. This was a deliberate design choice, as we may use the software for PDFs from some other digitisation projects in the future. Meanwhile, the feature extraction module could be extended quite easily to include additional characteristics.&lt;/p&gt;

&lt;p&gt;On a different note, I was pleasantly surprised by PyMuPDF. I had never worked with it before (or any other Python PDF library for that matter), but using it turned out to be unexpectedly straighforward. I will definitely return to it for future PDF-related work, and other PDF/Python heads should probably check it out.&lt;/p&gt;

&lt;h2 id=&quot;link-to-pdfquad&quot;&gt;Link to Pdfquad&lt;/h2&gt;

&lt;p&gt;Pdfquad and its documentation can be found here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdfquad/tree/main&quot;&gt;PDF QUality Assessment for Digitisation batches&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;It has since been repaced by third-party software with the same functionality (which also uses Jpylyzer) &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2024/12/13/pdf-quality-assessment-for-digitisation-batches-with-python-pymupdf-and-pillow</link>
                <guid>https://bitsgalore.org/2024/12/13/pdf-quality-assessment-for-digitisation-batches-with-python-pymupdf-and-pillow</guid>
                <pubDate>2024-12-13T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Escape from the phantom of the PDF</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/11/ghost.jpg&quot; alt=&quot;Aquatint showing a graveyard scene in front of a church. At the center is a man in armour with a distressed expression on his face. To his left is a skeleton, and to his right a ghost.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://wellcomecollection.org/works/sfqkzeu2&quot;&gt;A man in armour is confronted by a ghost and a skeleton. &lt;/a&gt;Aquatint. Wellcome Collection, Public Domain.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;In a &lt;a href=&quot;https://digitalpreservation.fi/en/2024-phantom-pdf-file&quot;&gt;recent blog post&lt;/a&gt;, colleagues at the National Digital Preservation Services in Finland addressed an issue with PDF files that contain strings with octal escape sequences. These are not parsed correctly by &lt;a href=&quot;https://jhove.openpreservation.org/&quot;&gt;JHOVE&lt;/a&gt;, and the resulting parse errors ultimately lead to (seemingly unrelated) validation errors. The authors argue that octal escape sequences present a preservation risk, as they may confuse other software besides JHOVE. Since this claim is not backed up by any evidence, here I put this to the test using 8 different PDF processing tools and libraries.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;the-phantom-of-a-pdf-file&quot;&gt;The Phantom of a PDF File&lt;/h2&gt;

&lt;p&gt;A few weeks ago I came across &lt;a href=&quot;https://digitalpreservation.fi/en/2024-phantom-pdf-file&quot;&gt;The Phantom of a PDF File&lt;/a&gt;, a blog post by Juha Lehtonen &amp;amp; Johan Kylander. In this post, they address an issue with a &lt;a href=&quot;https://digitalpreservation.fi/sites/default/files/2024/phantom_of_a_pdf_file_blog_post_2024.pdf&quot;&gt;PDF file&lt;/a&gt; that gives an unexpected validation error in &lt;a href=&quot;https://jhove.openpreservation.org/&quot;&gt;JHOVE&lt;/a&gt;. Digging into the PDF structure, they were able to track this down to the presence of &lt;a href=&quot;https://en.wikipedia.org/wiki/Escape_sequences_in_C&quot;&gt;octal escape sequences&lt;/a&gt; in the “Producer” field of the file’s Document Information Dictionary:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;9 0 obj
&amp;lt;&amp;lt;
/Title (Boo)
/CreationDate (D:20241029134330Z00&apos;00&apos;) 
/Producer (\376\377\000P\000D\000F\000 \000P\000h\000a\000n\000t\000o\000m\000\000)
&amp;gt;&amp;gt;
endobj
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;JHOVE cannot handle octal escape sequences properly, which leads to a parse error that ultimately results in JHOVE reporting a validation error that is completely unrelated to the “Producer” field. In their original post, the authors claimed that the way the octal escape sequences are used in this file is not allowed by the PDF specification. They also advised against the use of octal escape sequences, and to restrict the values in PDF metadata fields to plain ASCII, or otherwise UTF-16BE. Finally they argue that the presence of octal escape sequences&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; “raises the risk of causing problems in the future”, and that “JHOVE probably is not the only software that will get confused” by this.&lt;/p&gt;

&lt;h2 id=&quot;is-jhove-the-only-software-that-gets-confused-by-octal-escape-sequences&quot;&gt;Is JHOVE the only software that gets confused by octal escape sequences?&lt;/h2&gt;

&lt;p&gt;After reading this post, I posted the link to &lt;a href=&quot;https://github.com/openpreserve/jhove/issues/927&quot;&gt;this existing ticket on PDF parse issues&lt;/a&gt;. This immediately resulted in &lt;a href=&quot;https://github.com/openpreserve/jhove/issues/927#issuecomment-2454914910&quot;&gt;a response by Peter Wyatt&lt;/a&gt; of the &lt;a href=&quot;https://pdfa.org/&quot;&gt;PDF Association&lt;/a&gt;, who pointed out that, contrary to the claims made in the initial version of the post, the use of octal escape sequences in the “problematic” file is actually perfectly valid according to &lt;a href=&quot;https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf&quot;&gt;ISO 32000-1&lt;/a&gt;. In response to this comment, the authors corrected this in an update to their post. However, they seem to stick by their claim that “JHOVE probably is not the only software that will get confused” by octal escape sequences. Out of curiosity, I decided to put this to the test.&lt;/p&gt;

&lt;h2 id=&quot;enter-the-opf-phantom&quot;&gt;Enter the OPF Phantom&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://digitalpreservation.fi/en/2024-phantom-pdf-file&quot;&gt;original file&lt;/a&gt; contains two “Producer” fields: one in the Document Information Dictionary (which is the one we’re interested in here), and another one that is part of the XMP metadata. Since both have the same value (“PDF Phantom”), I changed the XMP value to “OPF Phantom” in a Hex editor. This way, we can easily distiguish between both fields. The modified file &lt;a href=&quot;https://github.com/user-attachments/files/17685339/phantom_modified_xmp.pdf&quot;&gt;is available here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Next, I ran the modified file through 8 different PDF tools and libraries, and inspected the result to see how the “Producer” field is handled. Below are the commands and results.&lt;/p&gt;

&lt;h2 id=&quot;exiftool&quot;&gt;&lt;a href=&quot;https://exiftool.org/&quot;&gt;ExifTool&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exiftool -X phantom_modified_xmp.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&apos;1.0&apos; encoding=&apos;UTF-8&apos;?&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;rdf:RDF&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;xmlns:rdf=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://www.w3.org/1999/02/22-rdf-syntax-ns#&apos;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;

&lt;span class=&quot;nt&quot;&gt;&amp;lt;rdf:Description&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;rdf:about=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;phantom_modified_xmp.pdf&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;xmlns:et=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://ns.exiftool.org/1.0/&apos;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;et:toolkit=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;Image::ExifTool 12.60&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;xmlns:ExifTool=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://ns.exiftool.org/ExifTool/1.0/&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;xmlns:System=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://ns.exiftool.org/File/System/1.0/&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;xmlns:File=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://ns.exiftool.org/File/1.0/&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;xmlns:PDF=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://ns.exiftool.org/PDF/PDF/1.0/&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;xmlns:XMP-x=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://ns.exiftool.org/XMP/XMP-x/1.0/&apos;&lt;/span&gt;
  &lt;span class=&quot;na&quot;&gt;xmlns:XMP-pdf=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&apos;http://ns.exiftool.org/XMP/XMP-pdf/1.0/&apos;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;ExifTool:ExifToolVersion&amp;gt;&lt;/span&gt;12.60&lt;span class=&quot;nt&quot;&gt;&amp;lt;/ExifTool:ExifToolVersion&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;System:FileName&amp;gt;&lt;/span&gt;phantom_modified_xmp.pdf&lt;span class=&quot;nt&quot;&gt;&amp;lt;/System:FileName&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;System:Directory&amp;gt;&lt;/span&gt;.&lt;span class=&quot;nt&quot;&gt;&amp;lt;/System:Directory&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;System:FileSize&amp;gt;&lt;/span&gt;5.9 kB&lt;span class=&quot;nt&quot;&gt;&amp;lt;/System:FileSize&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;System:FileModifyDate&amp;gt;&lt;/span&gt;2024:11:09 00:22:05+00:00&lt;span class=&quot;nt&quot;&gt;&amp;lt;/System:FileModifyDate&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;System:FileAccessDate&amp;gt;&lt;/span&gt;2024:11:09 00:22:46+00:00&lt;span class=&quot;nt&quot;&gt;&amp;lt;/System:FileAccessDate&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;System:FileInodeChangeDate&amp;gt;&lt;/span&gt;2024:11:09 00:22:05+00:00&lt;span class=&quot;nt&quot;&gt;&amp;lt;/System:FileInodeChangeDate&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;System:FilePermissions&amp;gt;&lt;/span&gt;-rw-rw-r--&lt;span class=&quot;nt&quot;&gt;&amp;lt;/System:FilePermissions&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;File:FileType&amp;gt;&lt;/span&gt;PDF&lt;span class=&quot;nt&quot;&gt;&amp;lt;/File:FileType&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;File:FileTypeExtension&amp;gt;&lt;/span&gt;pdf&lt;span class=&quot;nt&quot;&gt;&amp;lt;/File:FileTypeExtension&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;File:MIMEType&amp;gt;&lt;/span&gt;application/pdf&lt;span class=&quot;nt&quot;&gt;&amp;lt;/File:MIMEType&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;PDF:PDFVersion&amp;gt;&lt;/span&gt;1.4&lt;span class=&quot;nt&quot;&gt;&amp;lt;/PDF:PDFVersion&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;PDF:Linearized&amp;gt;&lt;/span&gt;No&lt;span class=&quot;nt&quot;&gt;&amp;lt;/PDF:Linearized&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;PDF:PageCount&amp;gt;&lt;/span&gt;1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/PDF:PageCount&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;PDF:Title&amp;gt;&lt;/span&gt;Boo&lt;span class=&quot;nt&quot;&gt;&amp;lt;/PDF:Title&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;PDF:CreateDate&amp;gt;&lt;/span&gt;2024:10:29 13:43:30Z&lt;span class=&quot;nt&quot;&gt;&amp;lt;/PDF:CreateDate&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;PDF:Producer&amp;gt;&lt;/span&gt;PDF Phantom&lt;span class=&quot;nt&quot;&gt;&amp;lt;/PDF:Producer&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;XMP-x:XMPToolkit&amp;gt;&lt;/span&gt;Image::ExifTool 12.71&lt;span class=&quot;nt&quot;&gt;&amp;lt;/XMP-x:XMPToolkit&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;XMP-pdf:Producer&amp;gt;&lt;/span&gt;OPF Phantom&lt;span class=&quot;nt&quot;&gt;&amp;lt;/XMP-pdf:Producer&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/rdf:Description&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/rdf:RDF&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;ExifTool correctly decodes the octal escape sequences (PDF:Producer), and also extracts the XMP value (XMP-pdf:Producer).&lt;/p&gt;

&lt;h2 id=&quot;pdfcpu&quot;&gt;&lt;a href=&quot;https://pdfcpu.io/&quot;&gt;Pdfcpu&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfcpu info phantom_modified_xmp.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;         PDF version: 1.4
          Page count: 1
           Page size: 595.28 x 841.89 points
............................................
               Title: Boo
              Author: 
             Subject: 
        PDF Producer: PDF Phantom
     Content creator: 
       Creation date: D:20241029134330Z00&apos;00&apos;
   Modification date: 
............................................
              Tagged: No
              Hybrid: No
          Linearized: No
  Using XRef streams: No
Using object streams: No
         Watermarked: No
............................................
           Encrypted: No
         Permissions: Full access
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pdfcpu correctly decodes the octal escape sequences (PDF Producer).&lt;/p&gt;

&lt;h2 id=&quot;pdfinfo-poppler&quot;&gt;pdfinfo (&lt;a href=&quot;https://poppler.freedesktop.org/&quot;&gt;Poppler&lt;/a&gt;)&lt;/h2&gt;

&lt;p&gt;Command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfinfo phantom_modified_xmp.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Title:          Boo
Producer:       PDF Phantom
CreationDate:   Tue Oct 29 14:43:30 2024 CET
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      5906 bytes
Optimized:      no
PDF version:    1.4
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Poppler correctly decodes the octal escape sequences (Producer).&lt;/p&gt;

&lt;h2 id=&quot;verapdf&quot;&gt;&lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf --off --extract phantom_modified_xmp.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The result includes:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;informationDict&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;entry&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;key=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Title&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Boo&lt;span class=&quot;nt&quot;&gt;&amp;lt;/entry&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;entry&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;key=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Producer&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;PDF Phantom#x000000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/entry&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;entry&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;key=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;CreationDate&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;2024-10-29T13:43:30.000Z&lt;span class=&quot;nt&quot;&gt;&amp;lt;/entry&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/informationDict&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;VeraPDF correctly decodes the octal escape sequences. Note the null character at the end (which is actually part of the object string).&lt;/p&gt;

&lt;h2 id=&quot;apache-tika&quot;&gt;&lt;a href=&quot;https://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java -jar ~/tika/tika-app-2.9.2.jar phantom_modified_xmp.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&amp;gt;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;html&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;xmlns=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.w3.org/1999/xhtml&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;head&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:PDFVersion&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1.4&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:docinfo:title&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Boo&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:hasXFA&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:modify_annotations&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:can_print_degraded&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dcterms:created&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2024-10-29T13:43:30Z&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dc:format&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/pdf; version=1.4&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:fill_in_form&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:hasCollection&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:encrypted&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dc:title&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Boo&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Content-Length&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;5906&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:hasMarkedContent&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Content-Type&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/pdf&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:producer&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;OPF Phantom&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:extract_for_accessibility&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:assemble_document&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;xmpTPg:NPages&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;resourceName&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;phantom_modified_xmp.pdf&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:hasXMP&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:extract_content&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:can_print&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;X-TIKA:Parsed-By&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;org.apache.tika.parser.DefaultParser&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;X-TIKA:Parsed-By&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;org.apache.tika.parser.pdf.PDFParser&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;access_permission:can_modify&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:docinfo:producer&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDF Phantom�&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;meta&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdf:docinfo:created&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;content=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2024-10-29T13:43:30Z&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;title&amp;gt;&lt;/span&gt;Boo&lt;span class=&quot;nt&quot;&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/head&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;body&amp;gt;&amp;lt;div&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;page&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;p/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/body&amp;gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Tika correctly decodes the octal escape sequences (field pdf:docinfo:producer); like VeraPDF it shows the null character at the end of the string. Note that Tika also reports the XMP Producer value (field pdf:producer).&lt;/p&gt;

&lt;h2 id=&quot;qpdf&quot;&gt;&lt;a href=&quot;https://qpdf.sourceforge.io/&quot;&gt;Qpdf&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qpdf --json phantom_modified_xmp.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output contains:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nl&quot;&gt;&quot;9 0 R&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;/CreationDate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;D:20241029134330Z00&apos;00&apos;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;/Producer&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;PDF Phantom&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\u&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;0000&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;/Title&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Boo&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Qpdf correctly decodes the octal escape sequences, including the trailing null character.&lt;/p&gt;

&lt;h2 id=&quot;pdftk&quot;&gt;&lt;a href=&quot;https://www.pdflabs.com/tools/pdftk-server/&quot;&gt;Pdftk&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdftk phantom_modified_xmp.pdf dump_data
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;InfoBegin
InfoKey: CreationDate
InfoValue: D:20241029134330Z00&amp;amp;apos;00&amp;amp;apos;
InfoBegin
InfoKey: Producer
InfoValue: PDF Phantom
InfoBegin
InfoKey: Title
InfoValue: Boo
PdfID0: 71a810587639eb130aefddee35e3c49d
PdfID1: 71a810587639eb130aefddee35e3c49d
NumberOfPages: 1
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595.276 841.89
PageMediaDimensions: 595.276 841.89
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Pdftk correctly decodes the octal escape sequences.&lt;/p&gt;

&lt;h2 id=&quot;pymupdf&quot;&gt;&lt;a href=&quot;https://pymupdf.readthedocs.io/&quot;&gt;PyMuPDF&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;I tested PyMuPDF with this simple test script:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pprint&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pymupdf&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;myPDF&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;phantom_modified_xmp.pdf&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pymupdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;myPDF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;metadata&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metadata&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;pprint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;pp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metadata&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;{&apos;format&apos;: &apos;PDF 1.4&apos;,
 &apos;title&apos;: &apos;Boo&apos;,
 &apos;author&apos;: &apos;&apos;,
 &apos;subject&apos;: &apos;&apos;,
 &apos;keywords&apos;: &apos;&apos;,
 &apos;creator&apos;: &apos;&apos;,
 &apos;producer&apos;: &apos;PDF Phantom&apos;,
 &apos;creationDate&apos;: &quot;D:20241029134330Z00&apos;00&apos;&quot;,
 &apos;modDate&apos;: &apos;&apos;,
 &apos;trapped&apos;: &apos;&apos;,
 &apos;encryption&apos;: None}
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;PyMUPDF correctly decodes the octal escape sequences.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;All the above tools and libraries were able to decode the octal escape sequences without any problems. So JHOVE’s behaviour really seems to be the exception here, rather than the rule. Octal escape sequences are both allowed by the standard, and widely supported by PDF processing tools. Concluding, I would argue that they don’t pose a significant (if at all) preservation risk.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgments&quot;&gt;Acknowledgments&lt;/h2&gt;

&lt;p&gt;Thanks are due to Peter Wyatt (PDF Association) for his helpful comments in &lt;a href=&quot;https://github.com/openpreserve/jhove/issues/927&quot;&gt;this Github thread&lt;/a&gt;.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Actually they refer to this as a “dual encoding”, but this term is confusing because the octal escape sequences aren’t really an “encoding” at all, but rather a part of the lexical definition of PDF literal string objects. See &lt;a href=&quot;https://github.com/openpreserve/jhove/issues/927#issuecomment-2466229030&quot;&gt;this follow-up comment by Peter Wyatt&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2024/11/14/escape-from-the-phantom-of-the-pdf</link>
                <guid>https://bitsgalore.org/2024/11/14/escape-from-the-phantom-of-the-pdf</guid>
                <pubDate>2024-11-14T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>JPEG quality estimation using simple least squares matching of quantization tables</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/quality-sign.jpg&quot; alt=&quot;Photograph of faded sign on building front showing the word &apos;Quality&apos;.&quot; /&gt;
  &lt;figcaption&gt;Adapted from &lt;a href=&quot;https://www.flickr.com/photos/120143184@N05/47939868992/&quot;&gt;Quality Coal&lt;/a&gt; by &lt;a href=&quot;https://www.flickr.com/photos/120143184@N05/&quot;&gt;Greenville Daily Photo&lt;/a&gt;. Used under &lt;a href=&quot;https://creativecommons.org/publicdomain/zero/1.0/&quot;&gt;CC0 1.0. license&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;In my &lt;a href=&quot;/2024/10/23/jpeg-quality-estimation-experiments-with-a-modified-imagemagick-heuristic&quot;&gt;previous post&lt;/a&gt; I addressed several problems I ran into when I tried to estimate the “last saved” quality level of JPEG images. It described some experiments based on &lt;a href=&quot;https://imagemagick.org/&quot;&gt;ImageMagick&lt;/a&gt;’s quality heuristic, which led to a &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/jpegquality-im-modified.py&quot;&gt;Python implementation of a modified version of the heuristic&lt;/a&gt; that improves the behaviour for images with a quality of 50% or less.&lt;/p&gt;

&lt;p&gt;I still wasn’t entirely happy with this solution. This was partially because ImageMagick’s heuristic uses &lt;em&gt;aggregated&lt;/em&gt; coefficients of the image’s quantization tables, which makes it potentially vulnerable to collisions. Another concern was, that the reasoning behind certain details of ImageMagick’s heuristic seems rather opaque (at least to me!).&lt;/p&gt;

&lt;p&gt;In this post I explore a different approach to JPEG quality estimation, which is based on a straightforward comparison with “standard” JPEG quantization tables using least squares matching. I also propose a measure that characterizes how similar an image’s quantization tables are to its closest “standard” tables. This could be useful as a measure of confidence in the quality estimate. I present some tests where I compare the results of the least squares matching method with those of the ImageMagick heuristics. I also discuss the results of a simple sensitivity analysis.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;jpeg-quality-and-standard-quantization-tables&quot;&gt;JPEG quality and standard quantization tables&lt;/h2&gt;

&lt;p&gt;ImageMagick’s JPEG quality heuristic is based on the “standard” quantization tables that are defined in Annex K of &lt;a href=&quot;http://www.w3.org/Graphics/JPEG/itu-t81.pdf&quot;&gt;the JPEG standard&lt;/a&gt;. Its overall objective seems to be to match an image’s quantization tables with the most similar “standard” quantization tables. Since the quality level of each of these “standard” tables is known, this then provides the quality estimate.&lt;/p&gt;

&lt;p&gt;ImageMagick’s heuristic does this in an indirect (and to me somewhat opaque) way, possibly to avoid computational cost. This post explores a more straightforward approach, which simply compares the coefficients in an image’s quantization tables against the corresponding coefficients in the “standard” tables, and then returns the quality level of the best match&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;scaling-of-standard-tables-to-quality-levels&quot;&gt;Scaling of standard tables to quality levels&lt;/h2&gt;

&lt;p&gt;To understand how this works, it’s first important to know that Annex K in the &lt;a href=&quot;http://www.w3.org/Graphics/JPEG/itu-t81.pdf&quot;&gt;JPEG standard&lt;/a&gt; describes two “standard” “quantization tables for luminance and chrominance. Here’s the luminance table:&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;16&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;16&lt;/td&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;51&lt;/td&gt;
      &lt;td&gt;61&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;14&lt;/td&gt;
      &lt;td&gt;19&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;58&lt;/td&gt;
      &lt;td&gt;60&lt;/td&gt;
      &lt;td&gt;55&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;14&lt;/td&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;16&lt;/td&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;57&lt;/td&gt;
      &lt;td&gt;69&lt;/td&gt;
      &lt;td&gt;56&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;14&lt;/td&gt;
      &lt;td&gt;17&lt;/td&gt;
      &lt;td&gt;22&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;51&lt;/td&gt;
      &lt;td&gt;87&lt;/td&gt;
      &lt;td&gt;80&lt;/td&gt;
      &lt;td&gt;62&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;22&lt;/td&gt;
      &lt;td&gt;37&lt;/td&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;68&lt;/td&gt;
      &lt;td&gt;109&lt;/td&gt;
      &lt;td&gt;103&lt;/td&gt;
      &lt;td&gt;77&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;24&lt;/td&gt;
      &lt;td&gt;35&lt;/td&gt;
      &lt;td&gt;55&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;81&lt;/td&gt;
      &lt;td&gt;104&lt;/td&gt;
      &lt;td&gt;113&lt;/td&gt;
      &lt;td&gt;92&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;49&lt;/td&gt;
      &lt;td&gt;64&lt;/td&gt;
      &lt;td&gt;78&lt;/td&gt;
      &lt;td&gt;87&lt;/td&gt;
      &lt;td&gt;103&lt;/td&gt;
      &lt;td&gt;121&lt;/td&gt;
      &lt;td&gt;120&lt;/td&gt;
      &lt;td&gt;101&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;72&lt;/td&gt;
      &lt;td&gt;92&lt;/td&gt;
      &lt;td&gt;95&lt;/td&gt;
      &lt;td&gt;98&lt;/td&gt;
      &lt;td&gt;112&lt;/td&gt;
      &lt;td&gt;100&lt;/td&gt;
      &lt;td&gt;103&lt;/td&gt;
      &lt;td&gt;99&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This is the base table with coefficients that are valid for quality level 50. The tables for all other quality levels can be derived from this base table using Equations 1 and 2 from &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S1742287608000285&quot;&gt;this paper by Kornblum (2008)&lt;/a&gt;&lt;sup id=&quot;fnref:12&quot;&gt;&lt;a href=&quot;#fn:12&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. First, for each quality level &lt;em&gt;Q&lt;/em&gt; we can calculate a corresponding scaling factor &lt;em&gt;S&lt;/em&gt;:&lt;/p&gt;

&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;
  &lt;mrow&gt;
    &lt;mi&gt;S&lt;/mi&gt;
    &lt;mo&gt;=&lt;/mo&gt;
    &lt;mfrac&gt;
      &lt;mrow&gt;
        &lt;mn&gt;5000&lt;/mn&gt;
      &lt;/mrow&gt;
      &lt;mrow&gt;
        &lt;mi&gt;Q&lt;/mi&gt;
      &lt;/mrow&gt;
    &lt;/mfrac&gt;
    &lt;mspace width=&quot;6em&quot; /&gt;
    &lt;mo&gt;(&lt;/mo&gt;
    &lt;mi&gt;Q&lt;/mi&gt;
    &lt;mo&gt;&amp;lt;&lt;/mo&gt;
    &lt;mn&gt;50&lt;/mn&gt;
    &lt;mo&gt;)&lt;/mo&gt;
  &lt;/mrow&gt;
&lt;/math&gt;

&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;
  &lt;mrow&gt;
    &lt;mi&gt;S&lt;/mi&gt;
    &lt;mo&gt;=&lt;/mo&gt;
    &lt;mn&gt;200&lt;/mn&gt;
    &lt;mo&gt;&amp;middot;&lt;/mo&gt;
    &lt;mi&gt;Q&lt;/mi&gt;
    &lt;mspace width=&quot;5em&quot; /&gt;
    &lt;mo&gt;(&lt;/mo&gt;
    &lt;mi&gt;Q&lt;/mi&gt;
    &lt;mo&gt;&amp;ge;&lt;/mo&gt;
    &lt;mn&gt;50&lt;/mn&gt;
    &lt;mo&gt;)&lt;/mo&gt;
  &lt;/mrow&gt;
&lt;/math&gt;

&lt;p&gt;&lt;em&gt;S&lt;/em&gt; is then used to calculate scaled quantization coefficients &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;sub&gt;s&lt;/sub&gt;&lt;/em&gt; from the base coefficients &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;sub&gt;b&lt;/sub&gt;&lt;/em&gt; using the following equation:&lt;/p&gt;

&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;
  &lt;mrow&gt;
    &lt;msubsup&gt;
      &lt;mi&gt;T&lt;/mi&gt;
      &lt;mi&gt;s&lt;/mi&gt;
      &lt;mi&gt;i&lt;/mi&gt;
    &lt;/msubsup&gt;
    &lt;mo&gt;=&lt;/mo&gt;
    &lt;mo&gt;max&lt;/mo&gt;
    &lt;mo&gt;(&lt;/mo&gt;
    &lt;mo&gt;&amp;#x230A;&lt;/mo&gt;
    &lt;mfrac&gt;
      &lt;mrow&gt;
        &lt;mi&gt;S&lt;/mi&gt;
        &lt;mo&gt;&amp;middot;&lt;/mo&gt;
        &lt;msubsup&gt;
          &lt;mi&gt;T&lt;/mi&gt;
          &lt;mi&gt;b&lt;/mi&gt;
          &lt;mi&gt;i&lt;/mi&gt;
        &lt;/msubsup&gt;
        &lt;mo&gt;+&lt;/mo&gt;
        &lt;mn&gt;50&lt;/mn&gt;
      &lt;/mrow&gt;
      &lt;mrow&gt;
        &lt;mn&gt;100&lt;/mn&gt;
      &lt;/mrow&gt;
    &lt;/mfrac&gt;
    &lt;mo&gt;&amp;#x230B;&lt;/mo&gt;
    &lt;mo&gt;,&lt;/mo&gt;
    &lt;mn&gt;1&lt;/mn&gt;
    &lt;mo&gt;)&lt;/mo&gt;
  &lt;/mrow&gt;
&lt;/math&gt;

&lt;p&gt;Here &lt;em&gt;i&lt;/em&gt; is the &lt;em&gt;i&lt;/em&gt;th element in the table. Note the &lt;a href=&quot;https://en.wikipedia.org/wiki/Floor_and_ceiling_functions&quot;&gt;floor brackets&lt;/a&gt;, which mean the expression inside them is rounded down to the nearest integer number. For 8-bit quantization tables (which is the most common situation) the scaled coefficients also need to be capped at a maximum of 255.&lt;/p&gt;

&lt;p&gt;As an example, applying these equations to &lt;em&gt;Q=75&lt;/em&gt; results in a scaling factor &lt;em&gt;S&lt;/em&gt; of 15000, and the quantization coefficients become:&lt;/p&gt;

&lt;table&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;6&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;10&lt;/td&gt;
      &lt;td&gt;13&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;30&lt;/td&gt;
      &lt;td&gt;28&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;20&lt;/td&gt;
      &lt;td&gt;29&lt;/td&gt;
      &lt;td&gt;35&lt;/td&gt;
      &lt;td&gt;28&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;7&lt;/td&gt;
      &lt;td&gt;9&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;15&lt;/td&gt;
      &lt;td&gt;26&lt;/td&gt;
      &lt;td&gt;44&lt;/td&gt;
      &lt;td&gt;40&lt;/td&gt;
      &lt;td&gt;31&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;9&lt;/td&gt;
      &lt;td&gt;11&lt;/td&gt;
      &lt;td&gt;19&lt;/td&gt;
      &lt;td&gt;28&lt;/td&gt;
      &lt;td&gt;34&lt;/td&gt;
      &lt;td&gt;55&lt;/td&gt;
      &lt;td&gt;52&lt;/td&gt;
      &lt;td&gt;39&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;12&lt;/td&gt;
      &lt;td&gt;18&lt;/td&gt;
      &lt;td&gt;28&lt;/td&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;41&lt;/td&gt;
      &lt;td&gt;52&lt;/td&gt;
      &lt;td&gt;57&lt;/td&gt;
      &lt;td&gt;46&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;25&lt;/td&gt;
      &lt;td&gt;32&lt;/td&gt;
      &lt;td&gt;39&lt;/td&gt;
      &lt;td&gt;44&lt;/td&gt;
      &lt;td&gt;52&lt;/td&gt;
      &lt;td&gt;61&lt;/td&gt;
      &lt;td&gt;60&lt;/td&gt;
      &lt;td&gt;51&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;36&lt;/td&gt;
      &lt;td&gt;46&lt;/td&gt;
      &lt;td&gt;48&lt;/td&gt;
      &lt;td&gt;49&lt;/td&gt;
      &lt;td&gt;56&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
      &lt;td&gt;52&lt;/td&gt;
      &lt;td&gt;50&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This way it is possible to calculate the quantization tables for all 100 quality levels. For the chrominance tables the same procedure can be used.&lt;/p&gt;

&lt;h2 id=&quot;estimating-quality-from-scaled-tables&quot;&gt;Estimating quality from scaled tables&lt;/h2&gt;

&lt;p&gt;For a given JPEG file, the quality can then be estimated by comparing its quantization tables against each of the scaled tables that are derived from the standard tables. As a basis for this comparison we can calculate, for each quality level, the sum of squared errors between the coefficients in the image’s quantization table and the corresponding coefficients in the scaled standard table. For an image with 2 quantization tables (one for luminance and another one for chrominance) this is given by:&lt;/p&gt;

&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;
  &lt;mrow&gt;
    &lt;mi&gt;SSE&lt;/mi&gt;
    &lt;mo&gt;=&lt;/mo&gt;
    &lt;munderover&gt;
      &lt;mo&gt;&amp;sum;&lt;/mo&gt;
      &lt;mrow&gt;
        &lt;mi&gt;i&lt;/mi&gt;
        &lt;mo&gt;=&lt;/mo&gt;
        &lt;mn&gt;1&lt;/mn&gt;
      &lt;/mrow&gt;
      &lt;mn&gt;64&lt;/mn&gt;
    &lt;/munderover&gt;
    &lt;msup&gt;
      &lt;mrow&gt;
        &lt;mo&gt;(&lt;/mo&gt;
        &lt;msubsup&gt;
          &lt;mi&gt;T&lt;/mi&gt;
          &lt;mi&gt;lum&lt;/mi&gt;
          &lt;mi&gt;i&lt;/mi&gt;
        &lt;/msubsup&gt;
        &lt;mo&gt;-&lt;/mo&gt;
        &lt;msubsup&gt;
          &lt;mi&gt;T&lt;/mi&gt;
          &lt;mrow&gt;
            &lt;mi&gt;s&lt;/mi&gt;
            &lt;mo&gt;,&lt;/mo&gt;
            &lt;mi&gt;lum&lt;/mi&gt;
          &lt;/mrow&gt;
          &lt;mi&gt;i&lt;/mi&gt;
        &lt;/msubsup&gt;
        &lt;mo&gt;)&lt;/mo&gt;
      &lt;/mrow&gt;
      &lt;mn&gt;2&lt;/mn&gt;
    &lt;/msup&gt;
    &lt;mo&gt;+&lt;/mo&gt;
    &lt;msup&gt;
      &lt;mrow&gt;
        &lt;mo&gt;(&lt;/mo&gt;
        &lt;msubsup&gt;
          &lt;mi&gt;T&lt;/mi&gt;
          &lt;mi&gt;chrom&lt;/mi&gt;
          &lt;mi&gt;i&lt;/mi&gt;
        &lt;/msubsup&gt;
        &lt;mo&gt;-&lt;/mo&gt;
        &lt;msubsup&gt;
          &lt;mi&gt;T&lt;/mi&gt;
          &lt;mrow&gt;
            &lt;mi&gt;s&lt;/mi&gt;
            &lt;mo&gt;,&lt;/mo&gt;
            &lt;mi&gt;chrom&lt;/mi&gt;
          &lt;/mrow&gt;
          &lt;mi&gt;i&lt;/mi&gt;
        &lt;/msubsup&gt;
        &lt;mo&gt;)&lt;/mo&gt;
      &lt;/mrow&gt;
      &lt;mn&gt;2&lt;/mn&gt;
    &lt;/msup&gt;
  &lt;/mrow&gt;
&lt;/math&gt;

&lt;p&gt;Here, &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;sub&gt;lum&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;sub&gt;chrom&lt;/sub&gt;&lt;/em&gt; represent the coefficients for luminance and chrominance from the image’s quantization table, and &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;sub&gt;s,lum&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;sub&gt;s, chrom&lt;/sub&gt;&lt;/em&gt; are the corresponding coefficients from the (scaled) standard tables.&lt;/p&gt;

&lt;p&gt;Repeating this for all quality levels results in 100 &lt;em&gt;SSE&lt;/em&gt; values. The quality level with the smallest &lt;em&gt;SSE&lt;/em&gt; value is then our best estimate for the quality of the image. An &lt;em&gt;SSE&lt;/em&gt; value of exactly 0 means the image uses the standard JPEG quantization tables. In that case we can be confident that our estimate is the exact quality level at which the image was compressed. Larger values indicate the use of non-standard quantization tables, in which case the quality estimate may be less accurate.&lt;/p&gt;

&lt;h2 id=&quot;characterizing-similarity-to-standard-tables&quot;&gt;Characterizing similarity to standard tables&lt;/h2&gt;

&lt;p&gt;For images that &lt;em&gt;don’t&lt;/em&gt; use the standard quantization tables, it would be useful to have some measure that expresses &lt;em&gt;how much&lt;/em&gt; the quantization tables deviate from the best matching standard table. This could be used as a measure of confidence in the quality estimate. By itself, &lt;em&gt;SSE&lt;/em&gt; is not a good measure of this. Firstly, it is influenced by the number of quantization tables in the image. We could remove this influence by transforming the &lt;em&gt;SSE&lt;/em&gt; value to a root mean squared error (&lt;em&gt;RMSE&lt;/em&gt;):&lt;/p&gt;

&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;
  &lt;mrow&gt;
    &lt;mi&gt;RMSE&lt;/mi&gt;
    &lt;mo&gt;=&lt;/mo&gt;
    &lt;msqrt&gt;
      &lt;mfrac&gt;
        &lt;mrow&gt;
          &lt;mi&gt;SSE&lt;/mi&gt;
        &lt;/mrow&gt;
        &lt;mrow&gt;
          &lt;mi&gt;tables&lt;/mi&gt;
          &lt;mo&gt;&amp;middot;&lt;/mo&gt;
          &lt;mn&gt;64&lt;/mn&gt;
        &lt;/mrow&gt;
      &lt;/mfrac&gt;
    &lt;/msqrt&gt;
  &lt;/mrow&gt;
&lt;/math&gt;

&lt;p&gt;Here, &lt;em&gt;tables&lt;/em&gt; is the number of quantization tables (which is either 1 or 2). The interpretation of these &lt;em&gt;RMSE&lt;/em&gt; values is still complicated by the fact that the quantization coefficients are significantly larger at lower quality levels. In practice this has the effect that the &lt;em&gt;RMSE&lt;/em&gt; values are generally much larger at low quality levels relative to higher quality levels, even for images with a similar overall “fit” to a standard quantization table.&lt;/p&gt;

&lt;p&gt;One “goodness of fit” measure that does not have this drawback is the &lt;a href=&quot;https://en.wikipedia.org/wiki/Nash%E2%80%93Sutcliffe_model_efficiency_coefficient&quot;&gt;Nash–Sutcliffe efficiency coefficient (&lt;em&gt;NSE&lt;/em&gt;)&lt;/a&gt;. It is mostly used in the field of hydrology to characterize how well the output of hydrological simulation models agrees with observations. Here, we will use it to characterize how well the coefficients in the standard table agree with those in the image’s quantization table. Rewritten for our quantization coefficients, &lt;em&gt;NSE&lt;/em&gt; is given by:&lt;/p&gt;

&lt;math xmlns=&quot;http://www.w3.org/1998/Math/MathML&quot;&gt;
  &lt;mrow&gt;
    &lt;mtext&gt;NSE&lt;/mtext&gt;
    &lt;mo&gt;=&lt;/mo&gt;
  &lt;/mrow&gt;
  &lt;mrow&gt;
    &lt;mn&gt;1&lt;/mn&gt;
    &lt;mo&gt;−&lt;/mo&gt;
  &lt;/mrow&gt;
  &lt;mrow&gt;
    &lt;mfrac&gt;
      &lt;mrow&gt;
        &lt;msubsup&gt;
          &lt;mo stretchy=&quot;true&quot;&gt;∑&lt;/mo&gt;
          &lt;mrow&gt;
            &lt;mi&gt;i&lt;/mi&gt;
            &lt;mo&gt;=&lt;/mo&gt;
            &lt;mn&gt;1&lt;/mn&gt;
          &lt;/mrow&gt;
          &lt;mi&gt;N&lt;/mi&gt;
        &lt;/msubsup&gt;
        &lt;msup&gt;
          &lt;mrow&gt;
            &lt;mo fence=&quot;true&quot; form=&quot;prefix&quot;&gt;(&lt;/mo&gt;
            &lt;msup&gt;
              &lt;mi&gt;T&lt;/mi&gt;
              &lt;mi&gt;i&lt;/mi&gt;
            &lt;/msup&gt;
            &lt;mo&gt;−&lt;/mo&gt;
            &lt;msubsup&gt;
              &lt;mi&gt;T&lt;/mi&gt;
              &lt;mi&gt;s&lt;/mi&gt;
              &lt;mi&gt;i&lt;/mi&gt;
            &lt;/msubsup&gt;
            &lt;mo fence=&quot;true&quot; form=&quot;postfix&quot;&gt;)&lt;/mo&gt;
          &lt;/mrow&gt;
          &lt;mn&gt;2&lt;/mn&gt;
        &lt;/msup&gt;
      &lt;/mrow&gt;
      &lt;mrow&gt;
        &lt;msubsup&gt;
          &lt;mo stretchy=&quot;true&quot;&gt;∑&lt;/mo&gt;
          &lt;mrow&gt;
            &lt;mi&gt;i&lt;/mi&gt;
            &lt;mo&gt;=&lt;/mo&gt;
            &lt;mn&gt;1&lt;/mn&gt;
          &lt;/mrow&gt;
          &lt;mi&gt;N&lt;/mi&gt;
        &lt;/msubsup&gt;
        &lt;msup&gt;
          &lt;mrow&gt;
            &lt;mo fence=&quot;true&quot; form=&quot;prefix&quot;&gt;(&lt;/mo&gt;
            &lt;msup&gt;
              &lt;mi&gt;T&lt;/mi&gt;
              &lt;mi&gt;i&lt;/mi&gt;
            &lt;/msup&gt;
            &lt;mo&gt;−&lt;/mo&gt;
            &lt;menclose notation=&quot;top&quot; class=&quot;tml-overline&quot;&gt;
              &lt;mi&gt;T&lt;/mi&gt;
            &lt;/menclose&gt;
            &lt;mo fence=&quot;true&quot; form=&quot;postfix&quot;&gt;)&lt;/mo&gt;
          &lt;/mrow&gt;
          &lt;mn&gt;2&lt;/mn&gt;
        &lt;/msup&gt;
      &lt;/mrow&gt;
    &lt;/mfrac&gt;
  &lt;/mrow&gt;
&lt;/math&gt;

&lt;p&gt;Here &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;/em&gt; represents the &lt;em&gt;i&lt;/em&gt;th coefficient from the image’s quantization tables, and &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;sub&gt;s&lt;/sub&gt;&lt;/em&gt; is the corresponding coefficient from the (scaled) standard tables. &lt;em&gt;N&lt;/em&gt; is the total number of coefficients in the image’s quantization tables. Note that, unlike in the  &lt;em&gt;SSE&lt;/em&gt; equation, the luminance and chrominance coefficients are lumped here for simplicity. Finally, &lt;span style=&quot;border-top: 1px solid #000000;&quot;&gt;&lt;em&gt;T&lt;/em&gt;&lt;/span&gt; is the mean of all coefficients &lt;em&gt;T&lt;sup&gt;i&lt;/sup&gt;&lt;/em&gt; in the image’s quantization tables. The interpretation of &lt;em&gt;NSE&lt;/em&gt; is quite straightforward:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A value of 1 indicates a perfect agreement between the image quantization tables and the corresponding standard tables.&lt;/li&gt;
  &lt;li&gt;For a value of 0, the standard tables are as good (or rather, bad) an approximation of the image’s quantization tables as &lt;span style=&quot;border-top: 1px solid #000000;&quot;&gt;&lt;em&gt;T&lt;/em&gt;&lt;/span&gt;.&lt;/li&gt;
  &lt;li&gt;Negative values indicate an extremely poor agreement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As an example, the below scatter plot shows the quantization coefficients (&lt;em&gt;T&lt;/em&gt;) from &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/dbnl/mul-master.jpg&quot;&gt;one of out dbnl master images&lt;/a&gt;, plotted against the corresponding coefficients (&lt;em&gt;T&lt;sub&gt;s&lt;/sub&gt;&lt;/em&gt;) from the best matching standard table. It also shows the line of perfect agreement (green, dashed), the quality estimate, the root mean squared error, and the Nash-Sutcliffe Efficiency:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/mul-master-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file mul-master.jpg, with Quality = 84%, RMSE = 1.057 and NSE = 0.997.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;The plot shows that, aside from one outlier in the chrominance table, the image’s quantization coefficients are closely approximated by those in the standard quantization tables. This is reflected by the &lt;em&gt;NSE&lt;/em&gt; value, which is close to 1. Now compare this with the plot I made for &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/dbnl/mul-access.jpg&quot;&gt;the corresponding access image&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/mul-access-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file mul-access.jpg, with Quality = 18%, RMSE = 6.244 and NSE = 0.999.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Visually, the agreement between the standard quantization table coefficients and those of the image looks very similar to the previous plot. But note how the &lt;em&gt;RMSE&lt;/em&gt; value is much larger here, which is caused by the much larger overall coefficients in the quantization tables. However, this doesn’t have any effect on &lt;em&gt;NSE&lt;/em&gt;, which is even marginally closer to 1 than for the master image.&lt;/p&gt;

&lt;p&gt;By contrast, things are very different for &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/image-177.jpg&quot;&gt;this image&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/image-177-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file image-177.jpg, with Quality = 81%, RMSE = 20.509 and NSE = 0.732.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;The plot shows that the standard JPEG tables are a relatively poor approximation here, and this is refected by the low &lt;em&gt;NSE&lt;/em&gt; value. A similar case is &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/sample-jpg-files-sample-4.jpg&quot;&gt;this image&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/sample-jpg-files-sample-4-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file sample-jpg-files-sample-4.jpg, with Quality = 89%, RMSE = 12.294 and NSE = 0.734.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Despite the different quality estimate and &lt;em&gt;RMSE&lt;/em&gt; value, this visually looks like it’s in the same ballpark as the previous image, and the similar &lt;em&gt;NSE&lt;/em&gt; value confirms this.&lt;/p&gt;

&lt;p&gt;Based on these examples, &lt;em&gt;NSE&lt;/em&gt; appears to give a good indication of the fit between the image and standard quantization coefficients. This makes it useful as a measure to asses the confidence in the method’s quality estimates.&lt;/p&gt;

&lt;h2 id=&quot;python-implementation-of-least-squares-matching-method&quot;&gt;Python implementation of least squares matching method&lt;/h2&gt;

&lt;p&gt;I created &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/jpegquality-lsm.py&quot;&gt;a test script with a Python implementation&lt;/a&gt; of the method. Apart from the quality estimate, it also reports the corresponding &lt;em&gt;RMSE&lt;/em&gt; and &lt;em&gt;NSE&lt;/em&gt; values.&lt;/p&gt;

&lt;h2 id=&quot;tests-with-pillow-and-imagemagick-jpegs&quot;&gt;Tests with Pillow and ImageMagick JPEGs&lt;/h2&gt;

&lt;p&gt;As a first test I ran the script on the &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/tree/main/images/im_pil&quot;&gt;Pillow and ImageMagick JPEGs&lt;/a&gt; I discussed in my previous post. This gave the following result:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;sub&gt;enc&lt;/sub&gt;&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;sub&gt;est&lt;/sub&gt;(PIL)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;RMSE(PIL)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;NSE(PIL)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;sub&gt;est&lt;/sub&gt;(IM)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;RMSE(IM)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;NSE(IM)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;75&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;75&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;75&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;100&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;100&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;100&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here &lt;em&gt;Q&lt;sub&gt;enc&lt;/sub&gt;&lt;/em&gt; is the encoding quality, and &lt;em&gt;Q&lt;sub&gt;est&lt;/sub&gt;(PIL)&lt;/em&gt; and &lt;em&gt;Q&lt;sub&gt;est&lt;/sub&gt;(IM)&lt;/em&gt; are the script’s estimates for the Pillow and ImageMagick images, respectively. The script correctly reproduced the encoding quality for all test images. The values &lt;em&gt;RMSE&lt;/em&gt;=0 and &lt;em&gt;NSE&lt;/em&gt;=1 also indicate that all images use the standard JPEG quantization tables.&lt;/p&gt;

&lt;h2 id=&quot;comparison-of-quality-estimation-methods&quot;&gt;Comparison of quality estimation methods&lt;/h2&gt;

&lt;p&gt;Far more interesting is the behaviour for images that &lt;em&gt;don’t&lt;/em&gt; use the standard tables. The previous section already showed the results for the &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/tree/main/images/dbnl&quot;&gt;dbnl master and access JPEGs&lt;/a&gt; that started this work. To test the method on a more diverse selection of images, I downloaded JPEGs from a variety of sources&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. I then ran them through &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/jpegquality-compare.py&quot;&gt;this test script&lt;/a&gt; that estimates the JPEG quality using my Python port of the original ImageMagick heuristic, the modified ImageMagick heuristic from my &lt;a href=&quot;/2024/10/23/jpeg-quality-estimation-experiments-with-a-modified-imagemagick-heuristic&quot;&gt;previous post&lt;/a&gt;, and the least squares matching method. I then identified images for which one or more of these methods came up with different results, and added a selection of them to &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/tree/main/images/misc&quot;&gt;this dataset&lt;/a&gt;. The following table shows the results of the script for this dataset:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;br /&gt;(im)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;br /&gt;(im, mod)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Exact&lt;br /&gt;(im, mod)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;br /&gt;(lsm)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;RMSE&lt;br /&gt;(lsm)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;NSE&lt;br /&gt;(lsm)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/psgradient.jpg&quot;&gt;psgradient.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;92&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;92&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;93&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4.438&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.747&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/hopper_16bit_qtables.jpg&quot;&gt;hopper_16bit_qtables.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;na&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;13&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.795&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/image-177.jpg&quot;&gt;image-177.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;63&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;63&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;81&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;20.509&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.732&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/image-98.jpg&quot;&gt;image-98.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;63&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;63&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;81&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;20.509&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.732&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/jpeg420exif.jpg&quot;&gt;jpeg420exif.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;90&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;90&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;89&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.424&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.999&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/jpeg422jfif.jpg&quot;&gt;jpeg422jfif.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;96&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;96&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;97&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/jpeg444.jpg&quot;&gt;jpeg444.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;60&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;60&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;75&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/sample-birch-400x300.jpg&quot;&gt;sample-birch-400x300.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;95&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;95&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;94&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.188&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.838&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/sample-jpg-files-sample-4.jpg&quot;&gt;sample-jpg-files-sample-4.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;78&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;78&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;89&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;12.294&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.734&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/tapedeck1.jpg&quot;&gt;tapedeck1.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;90&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;90&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;93&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.753&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.992&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/tapedeck2.jpg&quot;&gt;tapedeck2.jpg&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;92&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;92&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;False&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;95&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.421&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.996&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here &lt;em&gt;Q(im)&lt;/em&gt; is the quality estimate from the original ImageMagick heuristic, &lt;em&gt;Q(im), mod&lt;/em&gt; is the quality estimate from the modified ImageMagick heuristic, &lt;em&gt;Q(lsm)&lt;/em&gt; is the quality estimate from the least squares matching method. In addition &lt;em&gt;Exact(im, mod)&lt;/em&gt; is the “exactness” indicator of the modified ImageMagick heuristic, and &lt;em&gt;RMSE(lsm)&lt;/em&gt; and &lt;em&gt;NSE(lsm)&lt;/em&gt; are the root mean squared error and Nash-Sutcliffe Efficiency values reported by the least squares matching method, respectively.&lt;/p&gt;

&lt;p&gt;Below I highlight some of the more interesting results.&lt;/p&gt;

&lt;h3 id=&quot;jpeg444jpg&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/jpeg444.jpg&quot;&gt;jpeg444.jpg&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;For this image, both the original and modified ImageMagick heuristics estimate the quality at 60%, with no “exact” match. By contrast, the least squares matching method came up with a quality of 75%, with &lt;em&gt;RMSE&lt;/em&gt; and &lt;em&gt;NSE&lt;/em&gt; indicating a perfect match with the standard JPEG quantization tables. This is comfirmed by plotting the coefficients from the quantization tables against the standard coefficients:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/jpeg444-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file jpeg444.jpg, with Quality = 75%, RMSE = 0.0 and NSE = 1.0.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;I double-checked the result by uploading the image to FotoForensics, which &lt;a href=&quot;https://fotoforensics.com/analysis.php?id=2e8c6fc55fefbdf2e3b96e9c531d3d24b7b5ea16.5667&quot;&gt;also came up with 75% quality, and an exact match with the standard tables&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;image-98jpg&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/image-98.jpg&quot;&gt;image-98.jpg&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Here, the quality is estimated at 63% by both the original and modified ImageMagick heuristics, with an “exact” match. However, the least squares matching method results in a much higher (81%) quality, but the relatively low &lt;em&gt;NSE&lt;/em&gt; value of 0.732 indicates a poor fit to the standard JPEG tables. This is confirmed by the scatter plot of &lt;em&gt;T&lt;/em&gt; against &lt;em&gt;T&lt;sub&gt;s&lt;/sub&gt;&lt;/em&gt; :&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/image-98-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file image-98.jpg, with Quality = 81%, RMSE = 25.509 and NSE = 0.732.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;The &lt;a href=&quot;https://fotoforensics.com/analysis.php?id=3f271d3383ea2984b461620f2d54075dc5ec26da.37769&quot;&gt;FotoForensics result&lt;/a&gt; also indicates a quality of 81%.&lt;/p&gt;

&lt;h3 id=&quot;hopper_16bit_qtablesjpg&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/hopper_16bit_qtables.jpg&quot;&gt;hopper_16bit_qtables.jpg&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;This image is interesting for a number of reasons. ImageMagick’s original heuristic fails to come up with a quality estimate, while the modified ImageMagick heuristic returns a 1% estimate. Meanwhile, the least squares matching method estimates the quality at 13%. Here’s the corresponding scatter plot:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/hopper_16bit_qtables-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file hopper_16bit_qtables., with Quality = 13%, RMSE = 0.795 and NSE = 1.0.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;The coefficients in the JPEG quantization tables are usually stored as 8-bit unsigned integers, which means the highest possible value is 255. This particular image uses 16-bit values instead, which we can see from the range of &lt;em&gt;T&lt;/em&gt; values in the plot, which goes all the way up to 380! ImageMagick’s heuristic is unable to deal with this&lt;sup id=&quot;fnref:10&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, which results (for the modified version) in an unrealistically low value. The least squares matching method explicitly checks for coefficients outside the 8-bit range, and adjusts its calculations accordingly. Its quality estimate corresponds to &lt;a href=&quot;https://fotoforensics.com/analysis.php?id=079bf03e3187859bf2bdf0b28c341d3b75ad8442.2044&quot;&gt;the assessment by FotoForensics&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;One detail that caught my attention is the non-zero &lt;em&gt;RMSE&lt;/em&gt; value, which at first sight seems at odds with the reported &lt;em&gt;NSE&lt;/em&gt; value of 1.0. On closer inspection, it turned out that &lt;em&gt;NSE&lt;/em&gt; is actually marginally smaller than 1 here, but this is obscured by rounding the reported values at 3 decimals&lt;sup id=&quot;fnref:11&quot;&gt;&lt;a href=&quot;#fn:11&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h3 id=&quot;sample-jpg-files-sample-4jpg&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/misc/sample-jpg-files-sample-4.jpg&quot;&gt;sample-jpg-files-sample-4.jpg&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Both ImageMagick heuristics estimate the quality of this image at 78% with an “exact” match, whereas the least squares matching method gives a much higher estimate of 89% (with quite a poor fit with the standard tables). The corresponding &lt;a href=&quot;https://fotoforensics.com/analysis.php?id=9271e2a81a4105d7bc326c76bac043c6265e4d8e.63379&quot;&gt;FotoForensics estimate&lt;/a&gt; is marginally different from this at 88%, but still very close. For completeness here’s its scatter plot (again):&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/sample-jpg-files-sample-4-scatter.png&quot; alt=&quot;scatter plot of T against Ts for file sample-jpg-files-sample-4.jpg, with Quality = 89%, RMSE = 12.294 and NSE = 0.734.&quot; /&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;conclusions-from-this-comparison&quot;&gt;Conclusions from this comparison&lt;/h2&gt;

&lt;p&gt;Even though the tests presented here are quite limited, the differences that can occur between the quality estimates by the least squares matching method and the ImageMagick heuristics are quite striking. What surprised me in particular, was that even for JPEGs that use the standard quantization tables, ImageMagick’s heuristic may still provide quality estimates that are quite inaccurate. Of course, a major limitation here is the lack of reliable “ground truth” in the form of known quality settings at the time the test images were created. However, the good agreement between the quality estimates of the least squares matching method and the FotoForensics service does inspire some confidence in the methodology.&lt;/p&gt;

&lt;p&gt;Another surprise was that ImageMagick’s “exactness” flag isn’t actually indicative of an exact match with the standard JPEG tables. None of the test images for which it returned a “True” value actually contains standard quantization tables, whereas the two images that &lt;em&gt;do&lt;/em&gt; contain standard tables resulted in a “False” value!&lt;/p&gt;

&lt;h2 id=&quot;sensitivity-analysis&quot;&gt;Sensitivity analysis&lt;/h2&gt;

&lt;p&gt;To get a better impression of how the method behaves with non-standard quantization tables, I did a simple sensitivity analysis using the cjpeg utility that is part of &lt;a href=&quot;https://en.wikipedia.org/wiki/Libjpeg&quot;&gt;libjpeg&lt;/a&gt;. This tool supports compression with separate quality levels for the luminance and chrominance tables. I used this to &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/generate-testimages-cjpeg.sh&quot;&gt;compress one source image to all possible quality level combinations&lt;/a&gt;. This resulted in 10,000 images, which I then ran through the least squares matching method.&lt;/p&gt;

&lt;p&gt;The plot below shows the distribution of estimated quality values &lt;em&gt;Q&lt;sub&gt;lsm&lt;/sub&gt;&lt;/em&gt; against &lt;em&gt;Q&lt;sub&gt;av&lt;/sub&gt;&lt;/em&gt;, which are the corresponding averages of encoding qualities &lt;em&gt;Q&lt;sub&gt;lum&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;Q&lt;sub&gt;chrom&lt;/sub&gt;&lt;/em&gt;&lt;sup id=&quot;fnref:15&quot;&gt;&lt;a href=&quot;#fn:15&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/qav-qlsm.png&quot; alt=&quot;scatter plot of average encoding quality versus estimated encoding quality.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;At low average encoding qualities, the bandwidth of corresponding estimates by the least squares matching method is very wide, but this narrows considerably towards higher quality levels.&lt;/p&gt;

&lt;p&gt;Another, probably more useful view on the results appears if we plot the quality estimation errors, here expressed as absolute differences between &lt;em&gt;Q&lt;sub&gt;av&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;Q&lt;sub&gt;lsm&lt;/sub&gt;&lt;/em&gt;, against the Nash-Sutcliffe Efficiency coefficients:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/deltaq-nse.png&quot; alt=&quot;scatter plot of prediction errors (absolute differences between estimated and average encoding), versus Nash-Sutcliffe Efficiency.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;This shows a strong association between &lt;em&gt;NSE&lt;/em&gt; values in the upper range with small prediction errors. At lower &lt;em&gt;NSE&lt;/em&gt; values, the range of prediction errors becomes progressively wider. Thus, this demonstrates how &lt;em&gt;NSE&lt;/em&gt; can be useful as a measure of confidence in the quality estimate.&lt;/p&gt;

&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;

&lt;p&gt;One potential concern about the least squares matching method might be that it is computationally not very efficient: for each image, the analysis typically involves 200 comparisons of 64-element tables. Out of interest I did a little performance test using &lt;a href=&quot;https://github.com/yavuzceliker/sample-images&quot;&gt;this collection of 700 JPEGs&lt;/a&gt;. I analyzed these files with my scripts with the Python ports of the original and modified ImageMagick heuristics, and the least squares matching method.&lt;/p&gt;

&lt;p&gt;For each script run, I first ran this command to empty the cache memory:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;sudo sysctl vm.drop_caches=3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I then ran each script like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;(time python3 ./jpeg-quality-demo/jpegquality-lsm.py ./sample-images/images/*.jpg &amp;gt; sample-images.txt) 2&amp;gt; time-lsm.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The “time” command results in 3 performance metrics, The most important of which are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“real” - the actual amount of time passed between starting the script and its termination.&lt;/li&gt;
  &lt;li&gt;“user” - actual CPU time used in executing the process&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below table shows these metrics for the three scripts&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Method&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;time (real)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;time (user)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;im original&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m7,624s&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m3,321s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;im modified&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m7,215s&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m2,968s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;lsm&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m13,134s&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m8,673s&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This shows the least squares matching method is almost 3 times slower than the ImageMagick heuristics in terms of “user” time, and about 2 times slower in terms of “real” time. This translates to an average processing time of 0.01 to 0.02 s per file. Therefore, the reduced performance shouldn’t be any problem in practical terms.&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;I originally wrote the least squares matching method code in an attempt to better understand JPEG quality estimation, and to make a more informed assesment of how ImageMagick’s heuristic works. Based on the tests described here, I think I ended up with something that might actually be quite useful, and preferrable to either the original or modified ImageMagick heuristic.&lt;/p&gt;

&lt;p&gt;By itself the method isn’t in any way novel, as it’s basically just another implementation of the “Approximate Quantization Tables” quality estimation method as described by Neal Krawetz&lt;sup id=&quot;fnref:14&quot;&gt;&lt;a href=&quot;#fn:14&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;. I expect many other, very similar implementations exist that I’m simply not aware of, particularly in the digital forensics domain. This makes it all the more surprising that these apparently haven’t made it to popular image processing and analysis software like ImageMagick. The use of the Nash-Sutcliffe Efficiency as a measure of confidence in the quality estimate &lt;em&gt;may&lt;/em&gt; be somewhat novel, but I wouldn’t be surprised if other (and possibly better) methods for this exist.&lt;/p&gt;

&lt;p&gt;Finally, it’s important to be aware that the characterization of JPEG quality using the 1 - 100 scale that follows from the “standard” quantization tables is by itself pretty arbitrary&lt;sup id=&quot;fnref:13&quot;&gt;&lt;a href=&quot;#fn:13&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;. Essentially, the corresponding “quality” values are just pointers to sets of quantization tables that only have ordinal significance (i.e. higher values mean better quality), but not much more. Due to its wide use, and the lack of any better alternative, it’s still a useful benchmark. This is also why I think it’s important to provide some information on the similarity of an image’s quantization tables to the “standard” ones, as this helps assessing the confidence in the quality estimate.&lt;/p&gt;

&lt;p&gt;As always, any feedback and suggestions in response to this post are very welcome!&lt;/p&gt;

&lt;h2 id=&quot;scripts-and-test-data&quot;&gt;Scripts and test data&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo&quot;&gt;jpeg-quality-demo Github repository&lt;/a&gt; - Github repo with all scripts and test data that were used in this analysis.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/jpegquality-lsm.py&quot;&gt;jpegquality-lsm.py&lt;/a&gt; - Python implementation of the least squares matching method.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;important-note-on-python-pillow-version&quot;&gt;Important note on Python Pillow version&lt;/h3&gt;

&lt;p&gt;The Python implementation of the least squares matching method (and most of the other scripts as well) requires a recent version of the &lt;a href=&quot;https://python-pillow.org/&quot;&gt;Pillow Imaging Library&lt;/a&gt;. This is because around the release of version 8.3 (I think) Pillow changed the order in which it returns the values inside JPEG quantization tables (&lt;a href=&quot;https://github.com/python-pillow/Pillow/pull/4989&quot;&gt;details here&lt;/a&gt;). All scripts in the repo expect the current/new behaviour, and they will give &lt;em&gt;very&lt;/em&gt; wrong results when used with older Pillow versions!&lt;/p&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;1 November 2024: added paragraph on significance of JPEG quality scale; added sensitivity analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;This corresponds to the “Approximate Quantization Tables” method that is mentioned on (and used by) &lt;a href=&quot;https://fotoforensics.com/tutorial.php?tt=estq&quot;&gt;Neal Krawetz’s FotoForensics site&lt;/a&gt; (but the site doesn’t provide any details about the implementation). &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:12&quot;&gt;
      &lt;p&gt;See also &lt;a href=&quot;https://stackoverflow.com/a/29216609/1209004&quot;&gt;this post on StackOverflow&lt;/a&gt;. &lt;a href=&quot;#fnref:12&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;Most notably &lt;a href=&quot;https://github.com/yavuzceliker/sample-images&quot;&gt;sample-images&lt;/a&gt;, &lt;a href=&quot;https://samplelib.com/sample-jpeg.html&quot;&gt;samplelib.com&lt;/a&gt;, &lt;a href=&quot;https://toolsfairy.com/tools/image-test/sample-jpg-files&quot;&gt;toolsfairy.com&lt;/a&gt;, &lt;a href=&quot;https://www.w3.org/MarkUp/Test/xhtml-print/20050519/tests/A_2_1-BF-01.htm&quot;&gt;w3.org&lt;/a&gt; and &lt;a href=&quot;https://github.com/python-pillow/Pillow/tree/main/Tests/images&quot;&gt;the Pillow source repository&lt;/a&gt;. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot;&gt;
      &lt;p&gt;This is because its hard-coded &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hashes&lt;/code&gt; lists (see my &lt;a href=&quot;/2024/10/23/jpeg-quality-estimation-experiments-with-a-modified-imagemagick-heuristic&quot;&gt;previous post&lt;/a&gt;) are based on 8-bit values. &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:11&quot;&gt;
      &lt;p&gt;In this case the numerator of the &lt;em&gt;NSE&lt;/em&gt; equation (the &lt;em&gt;SSE&lt;/em&gt; value) was 81, and the denominator (the variance of the quantization coefficients) 7743557. This results in &lt;em&gt;NSE&lt;/em&gt; = 1 - (81/7743557) = 0.99998954, which is reported as 1.0 when rounded to 3 decimals. Meanwhile &lt;em&gt;RMSE&lt;/em&gt; = √(81/128) = 0.795. &lt;a href=&quot;#fnref:11&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:15&quot;&gt;
      &lt;p&gt;The actual significance of &lt;em&gt;Q&lt;sub&gt;av&lt;/sub&gt;&lt;/em&gt; is somewhat questionable, especially for extreme high/low combinations of &lt;em&gt;Q&lt;sub&gt;lum&lt;/sub&gt;&lt;/em&gt; and &lt;em&gt;Q&lt;sub&gt;chrom&lt;/sub&gt;&lt;/em&gt;. &lt;a href=&quot;#fnref:15&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;For a more in-depth explanation of these metrics see: &lt;a href=&quot;https://stackoverflow.com/a/556411/1209004&quot;&gt;https://stackoverflow.com/a/556411/1209004&lt;/a&gt; &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;I ran the test on a pretty low spec machine with an Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz with 4 cores. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:14&quot;&gt;
      &lt;p&gt;In response to this post, Krawetz &lt;a href=&quot;https://noc.social/@hackerfactor/113397249743521959&quot;&gt;let me know&lt;/a&gt; that FotoForensics uses a different algorithm that isn’t based on least squares. &lt;a href=&quot;#fnref:14&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:13&quot;&gt;
      &lt;p&gt;See e.g. &lt;a href=&quot;http://www.faqs.org/faqs/jpeg-faq/part1/section-5.html&quot;&gt;the JPEG FAQ&lt;/a&gt; for an explanation. &lt;a href=&quot;#fnref:13&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2024/10/30/jpeg-quality-estimation-using-simple-least-squares-matching-of-quantization-tables</link>
                <guid>https://bitsgalore.org/2024/10/30/jpeg-quality-estimation-using-simple-least-squares-matching-of-quantization-tables</guid>
                <pubDate>2024-10-30T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>JPEG quality estimation&#58; experiments with a modified ImageMagick heuristic</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/10/bailey-1024.jpg&quot; alt=&quot;Photograph of golden retriever dog Bailey sitting at a desk in front of a laptop, bashing her paws away at the laptop&apos;s keyboard while wearing a necktie.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://imgur.com/a/golden-baileys-story-pictures-XGli7&quot;&gt;Bailey AKA the &quot;I have no idea what I&apos;m doing&quot; dog&lt;/a&gt;. License unknown.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;In this post I explore some of the challenges I ran into while trying to estimate the quality level of JPEG images. By quality level I mean the percentage (1-100) that expresses the &lt;a href=&quot;https://en.wikipedia.org/wiki/Lossy_compression&quot;&gt;lossiness&lt;/a&gt; that was applied by the encoder at the last “save” operation. Here, a value of 1 results in very aggressive compression with a lot of information loss (and thus a very low quality), whereas at 100 almost no information loss occurs at all&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;More specifically, I focus on problems with &lt;a href=&quot;https://imagemagick.org/&quot;&gt;ImageMagick&lt;/a&gt;’s JPEG quality heuristic, which become particularly apparent when applied to low quality images. I also propose a simple tentative solution, that applies some small changes to ImageMagick’s heuristic.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;context-of-this-work&quot;&gt;Context of this work&lt;/h2&gt;

&lt;p&gt;I’m currently working on an automated workflow for quality-checking scanned books and periodicals in PDF format. These PDFs are created by external suppliers for The Digital Library for Dutch Literature (&lt;a href=&quot;https://www.dbnl.org/&quot;&gt;dbnl&lt;/a&gt;), which has been managed by the KB since 2015.&lt;/p&gt;

&lt;p&gt;Each book or periodical volume is scanned to PDF. For each publication, 2 versions are made:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A “production master” PDF with scans that are encoded at 85% JPEG quality.&lt;/li&gt;
  &lt;li&gt;A (relatively) small access PDF with scans encoded at 50% JPEG quality.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It’s important to have some way of verifying the approximate quality of both versions. The production master serves as input for derived EPUB, XML and unformatted text versions. If the scans are compressed too heavily, this adversely affects these derived products. On the other hand, too little compression on the access PDFs will result in files that are impractically large for access.&lt;/p&gt;

&lt;h2 id=&quot;estimating-jpeg-quality&quot;&gt;Estimating JPEG quality&lt;/h2&gt;

&lt;p&gt;Probably the best explainer on JPEG quality and its estimation is &lt;a href=&quot;https://fotoforensics.com/tutorial.php?tt=estq&quot;&gt;this tutorial on Neal Krawetz’s Fotoforensics site&lt;/a&gt;. The information under the “Estimating Quality” tab is particularly useful. I will return to this on various occasions later in this post.&lt;/p&gt;

&lt;h2 id=&quot;estimating-jpeg-quality-with-imagemagick&quot;&gt;Estimating JPEG quality with ImageMagick&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://imagemagick.org/&quot;&gt;ImageMagick&lt;/a&gt; is able to estimate the quality of a JPEG image. For example, let’s create a test image at 70% quality:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;convert -quality 70 wizard: wizard-70.jpg
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We can then get an estimate of the JPEG quality using:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;identify -format &apos;%Q\n&apos; wizard-70.jpg
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which results in:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;70
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However, when I ran this command on some of our access scans, the results were not what I expected. An example is &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/images/dbnl/260761857-bdf17e14-c697-4f4a-b6d5-e3067e0afc08.jpg&quot;&gt;this image&lt;/a&gt;, which I extracted from one of our PDFs&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. According to ImageMagick&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, this image is compressed at 92% quality. This JPEG was extracted from a small access PDF, for which I would expect a quality of 50% or less. Its file size is also much smaller than I would expect for a 92% quality image.&lt;/p&gt;

&lt;h2 id=&quot;check-with-fotoforensics&quot;&gt;Check with Fotoforensics&lt;/h2&gt;

&lt;p&gt;To verify this unexpected result I uploaded the image to Neal Krawetz’s  &lt;a href=&quot;https://fotoforensics.com/&quot;&gt;Fotoforensics&lt;/a&gt; service. The &lt;a href=&quot;https://fotoforensics.com/analysis.php?id=2e0a9f3203e35ece9a23c68c9e6dc7c908891372.353235&amp;amp;show=estq&quot;&gt;result is available here&lt;/a&gt;. Fotoforensics estimates the JPEG quality at a paltry 18%. So why does ImageMagick report a value of 92% here?&lt;/p&gt;

&lt;h2 id=&quot;imagemagick-uses-92-as-a-fallback-value&quot;&gt;ImageMagick uses “92” as a fallback value&lt;/h2&gt;

&lt;p&gt;From a cursory look at its source code, it seems that ImageMagick &lt;a href=&quot;https://github.com/ImageMagick/ImageMagick/blob/f5bdfdd62af7109ad105f8af4e28111e353edecd/MagickCore/property.c#L2725&quot;&gt;uses 92 as a fallback value&lt;/a&gt; if it cannot come up with a quality estimate. This makes the interpretation of its quality output needlessly difficult, since it’s impossible to differentiate between images that have a true 92% quality, and images for which the quality cannot be established. I &lt;a href=&quot;https://github.com/ImageMagick/ImageMagick6/issues/260&quot;&gt;created a ticket&lt;/a&gt; on Github when I first came across this issue over a year ago, but thus far it hasn’t been fixed.&lt;/p&gt;

&lt;p&gt;Even if it was fixed, this still leaves the question &lt;em&gt;why&lt;/em&gt; ImageMagick’s quality heuristic is failing here in the first place. As I’m not proficient in &lt;em&gt;C&lt;/em&gt;, I didn’t pursue things any further when I first ran into this issue.&lt;/p&gt;

&lt;h2 id=&quot;imagemagicks-jpeg-quality-heuristic&quot;&gt;ImageMagick’s JPEG quality heuristic&lt;/h2&gt;

&lt;p&gt;This changed when I recently came across &lt;a href=&quot;https://stackoverflow.com/a/75204019/1209004&quot;&gt;this StackOverflow post&lt;/a&gt;, which points to &lt;a href=&quot;https://gist.github.com/eddy-geek/c0f01dc5401dc50a49a0a821cdc9b3e8#file-jpg_quality_pil_magick-py&quot;&gt;a Python port of ImageMagick’s JPEG quality heuristic&lt;/a&gt; by one “Edward O”. It is based &lt;a href=&quot;https://github.com/ImageMagick/ImageMagick/blob/c6410959676151a94bb1efc32667571dadadd5df/coders/jpeg.c#L866&quot;&gt;on ImageMagicks original code&lt;/a&gt;, and estimates the JPEG compression quality from the quantization tables. This immediately grabbed my attention, since my quality-checking workflow is also implemented in Python. The ability to estimate JPEG quality natively in Python would remove the need to wrap any external tools for this.&lt;/p&gt;

&lt;p&gt;A quick test showed that it produced results that were identical to ImageMagick in most cases, but like ImageMagick, it failed to come up with a quality estimate for my problematic JPEG. To find out why, I incorporated Edwards’s code into &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/jpegquality-im-original.py&quot;&gt;this test script&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;

&lt;p&gt;The algorithm reads the image’s quantization tables (usually 2), each of which is a list of 64 quantization coefficients. It first adds up all of these numbers, resulting in variable &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qsum&lt;/code&gt;. It then calculates &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt;, which is the sum of the quantization coefficients at 2 specific positions in each table (these are then summed for all quantization tables). The main “meat and potatoes” of the heuristic is this loop at the very end of the function:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;if &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qvalue&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hashes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qsum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sums&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;continue&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;if &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qvalue&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hashes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qsum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sums&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;50&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For each iteration, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qsum&lt;/code&gt; are evaluated against the corresponding values in two hard-coded numerical lists (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hashes[i]&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums[i]&lt;/code&gt;). Although I couldn’t find any documentation, a little digging showed that the values in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums&lt;/code&gt; and  &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hashes&lt;/code&gt; lists are derived from the “standard” quantization tables defined in Annex K of &lt;a href=&quot;http://www.w3.org/Graphics/JPEG/itu-t81.pdf&quot;&gt;the JPEG standard&lt;/a&gt;&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. The first &lt;em&gt;if&lt;/em&gt; block makes sure that as long as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt; is smaller than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hashes[i]&lt;/code&gt; &lt;em&gt;and&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qsum&lt;/code&gt; is smaller than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums[i]&lt;/code&gt;, the code will immediately jump to the next iteration, skipping the second &lt;em&gt;if&lt;/em&gt; block. The second &lt;em&gt;if&lt;/em&gt; block (which reports the quality estimate as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i+1&lt;/code&gt;) is only evaluated if the test condition in the first block fails.&lt;/p&gt;

&lt;h2 id=&quot;tracing-all-loop-variables&quot;&gt;Tracing all loop variables&lt;/h2&gt;

&lt;p&gt;To get a better impression of why the heuristic cannot come up with a meaningful quality estimate for my problematic JPEG, I added a line of code that prints out all variables at the start of each iteration. This gave the following output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;i: 0, qvalue: 586, hashes[i]:1020, qsum:24028, sums[i]32640
i: 1, qvalue: 586, hashes[i]:1015, qsum:24028, sums[i]32635
i: 2, qvalue: 586, hashes[i]:932, qsum:24028, sums[i]32266
i: 3, qvalue: 586, hashes[i]:848, qsum:24028, sums[i]31495
i: 4, qvalue: 586, hashes[i]:780, qsum:24028, sums[i]30665
i: 5, qvalue: 586, hashes[i]:735, qsum:24028, sums[i]29804
i: 6, qvalue: 586, hashes[i]:702, qsum:24028, sums[i]29146
i: 7, qvalue: 586, hashes[i]:679, qsum:24028, sums[i]28599
i: 8, qvalue: 586, hashes[i]:660, qsum:24028, sums[i]28104
i: 9, qvalue: 586, hashes[i]:645, qsum:24028, sums[i]27670
i: 10, qvalue: 586, hashes[i]:632, qsum:24028, sums[i]27225
i: 11, qvalue: 586, hashes[i]:623, qsum:24028, sums[i]26725
i: 12, qvalue: 586, hashes[i]:613, qsum:24028, sums[i]26210
i: 13, qvalue: 586, hashes[i]:607, qsum:24028, sums[i]25716
i: 14, qvalue: 586, hashes[i]:600, qsum:24028, sums[i]25240
i: 15, qvalue: 586, hashes[i]:594, qsum:24028, sums[i]24789
i: 16, qvalue: 586, hashes[i]:589, qsum:24028, sums[i]24373
i: 17, qvalue: 586, hashes[i]:585, qsum:24028, sums[i]23946
quality: -1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here we see that at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i=17&lt;/code&gt;, the value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums[i]&lt;/code&gt; becomes smaller than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qsum&lt;/code&gt;. As a result, we end up in the second &lt;em&gt;if&lt;/em&gt; block. This reports the quality factor as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i+1&lt;/code&gt;, but &lt;em&gt;only&lt;/em&gt; if either of the following conditions is met:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;both &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt; is smaller than or equal to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hashes[i]&lt;/code&gt;, &lt;em&gt;and&lt;/em&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qsum&lt;/code&gt; is smaller than or equal to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums[i]&lt;/code&gt;, &lt;em&gt;or&lt;/em&gt;:&lt;/li&gt;
  &lt;li&gt;the value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i&lt;/code&gt; is larger than or equal to 50.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this case, we see that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt; is indeed smaller than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hashes[i]&lt;/code&gt;, but &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qsum&lt;/code&gt; is &lt;em&gt;not&lt;/em&gt; smaller than (or equal to) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums[i]&lt;/code&gt;. So the first condition is not met. Since &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;i&lt;/code&gt; equals 17, the second condition is not met either, meaning that the code doesn’t come up with a meaningful quality estimate, and reports the fallback value (-1) instead.&lt;/p&gt;

&lt;h2 id=&quot;effect-of-quality-threshold&quot;&gt;Effect of quality threshold&lt;/h2&gt;

&lt;p&gt;It’s not clear why the heuristic uses the quality threshold value of 50, although &lt;a href=&quot;https://fotoforensics.com/tutorial.php?tt=estq&quot;&gt;Neal Krawetz points out&lt;/a&gt; that “the JPEG Standard changes algorithms at quality values below 50%”&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. As a test, I tried  changing the threshold to 0:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nf&quot;&gt;if &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qvalue&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hashes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qsum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sums&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With this change, the code now reports a quality value of 18. Incidentally this is identical to the &lt;em&gt;Fotoforensics&lt;/em&gt; estimate.&lt;/p&gt;

&lt;h2 id=&quot;modified-imagemagick-heuristic&quot;&gt;Modified ImageMagick heuristic&lt;/h2&gt;

&lt;p&gt;Setting the threshold of 0 effectively makes the first condition in the second &lt;em&gt;if&lt;/em&gt; block superfluous, which means we could simplify things further to:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;nf&quot;&gt;if &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qvalue&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;hashes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;qsum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sums&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;continue&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I also noticed that the original ImageMagick code includes a variable that indicates whether the quality estimate is “exact” or “approximate” (&lt;a href=&quot;https://github.com/ImageMagick/ImageMagick6/blob/bf9bc7fee9f3cea9ab8557ad1573a57258eab95b/coders/jpeg.c#L1030&quot;&gt;here&lt;/a&gt;). I initially assumed here that an “exact” match implies a perfect agreement with the standard JPEG quantization tables. This would be useful information to assess the accuracy of the quality estimate. I created &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/jpegquality-im-modified.py&quot;&gt;a test script with a modified version of the ImageMagick heuristic&lt;/a&gt; that incorporates the following changes to the ImageMagick heuristic:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Removal of the quality thresold.&lt;/li&gt;
  &lt;li&gt;Added reporting of the “exactness” flag, based on the original ImageMagick code.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;tests-with-pillow-and-imagemagick-jpegs&quot;&gt;Tests with Pillow and ImageMagick JPEGs&lt;/h2&gt;

&lt;p&gt;As a first test I created a set of small test images with known quality values. I did this using Python’s &lt;a href=&quot;https://python-pillow.org/&quot;&gt;Pillow library&lt;/a&gt; and &lt;a href=&quot;https://imagemagick.org/&quot;&gt;ImageMagick&lt;/a&gt;. In both cases I generated test images with quality values 5, 10, 25, 50, 75 and 100, respectively. The images are available &lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/tree/main/images/im_pil&quot;&gt;here&lt;/a&gt;. Running the script with my modified ImageMagick heuristic gave the following result:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;sub&gt;enc&lt;/sub&gt;&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;sub&gt;est&lt;/sub&gt;(Pillow)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Exact(Pillow)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Q&lt;sub&gt;est&lt;/sub&gt;(IM)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Exact(IM)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;75&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;75&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;75&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;100&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;100&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;100&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;True&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here &lt;em&gt;Q&lt;sub&gt;enc&lt;/sub&gt;&lt;/em&gt; is the encoding quality, and &lt;em&gt;Q&lt;sub&gt;est&lt;/sub&gt;(Pillow)&lt;/em&gt; and &lt;em&gt;Q&lt;sub&gt;est&lt;/sub&gt;(IM)&lt;/em&gt; represent the script’s estimates for the Pillow and ImageMagick images, respectively. &lt;em&gt;Exact(Pillow)&lt;/em&gt; and &lt;em&gt;Exact(IM)&lt;/em&gt; represent the reported values of the “exactness” flag. The table shows that the script was able to reproduce the encoding quality with an “exact” match for all test images.&lt;/p&gt;

&lt;p&gt;I also ran the script on some of the problematic JPEGs from our dbnl access PDFs. Here, it estimated the JPEG quality at 18%, but without an “exact” match. The JPEGs from the corresponding master PDFs resulted in a 84% quality estimate (which is slightly less than the expected quality of 85%), but again without an “exact” result.&lt;/p&gt;

&lt;p&gt;Although the results of these (very limited!) tests look encouraging at first sight, this exercise left me with some doubts and reservations.&lt;/p&gt;

&lt;h2 id=&quot;limitations-of-imagemagicks-heuristic&quot;&gt;Limitations of ImageMagick’s heuristic&lt;/h2&gt;

&lt;p&gt;Most importantly, ImageMagick’s heuristic is based on a comparison of &lt;em&gt;aggregated&lt;/em&gt; coefficients of the image’s quantization tables, which makes it potentially vulnerable to collisions. As an example, for any value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qsum&lt;/code&gt; (the sum of all coefficients in the quantization table), many possible combinations of quantization coefficients exist that will add up to the same value. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt; check (which is based on coefficients at specific positions in the quantization table) partially overcomes this, but it does this in a pretty crude way using yet another aggregate measure.&lt;/p&gt;

&lt;p&gt;Another thing that bothered me, is that the reasoning behind certain aspects of ImageMagick’s heuristic isn’t entirely clear to me. Examples are the 50% threshold, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt; check (again) and the precise meaning of its “exact” vs “approximate” designation.&lt;/p&gt;

&lt;h2 id=&quot;further-down-the-jpeg-rabbit-hole&quot;&gt;Further down the JPEG rabbit hole&lt;/h2&gt;

&lt;p&gt;When I wrote the first draft of this post, I hoped it would provoke some response from people who are better versed than me in the inner workings of JPEG compression. As happens so often with these things, only a few days after publishing it I came across &lt;a href=&quot;https://www.sciencedirect.com/science/article/pii/S1742287608000285&quot;&gt;this 2008 paper&lt;/a&gt; which explains how quantization tables in JPEG work. This inspired me to do some more in-depth testing, and ultimately this resulted in an alternative, more straightforward quality estimation method. This is the subject of a &lt;a href=&quot;/2024/10/30/jpeg-quality-estimation-using-simple-least-squares-matching-of-quantization-tables&quot;&gt;follow-up blog post&lt;/a&gt;, which is out now!&lt;/p&gt;

&lt;h2 id=&quot;acknowledgment&quot;&gt;Acknowledgment&lt;/h2&gt;

&lt;p&gt;Thanks are due to &lt;a href=&quot;https://github.com/eddy-geek&quot;&gt;Eddy O (AKA “eddygeek”)&lt;/a&gt; for creating the &lt;a href=&quot;https://gist.github.com/eddy-geek/c0f01dc5401dc50a49a0a821cdc9b3e8&quot;&gt;Python port of ImageMagick’s JPEG quality heuristic&lt;/a&gt; from which most of the results in this post are derived.&lt;/p&gt;

&lt;h2 id=&quot;annex-other-jpeg-quality-estimation-tools-and-methods&quot;&gt;Annex: other JPEG quality estimation tools and methods&lt;/h2&gt;

&lt;p&gt;While working on this, I came across a few alternative tools and methods for JPEG quality estimation. Here’s a brief overview, which is largely for my own reference (but I imagine others may find it useful as well).&lt;/p&gt;

&lt;h3 id=&quot;exiftool&quot;&gt;ExifTool&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://exiftool.org/&quot;&gt;ExifTool&lt;/a&gt; also reports a JPEG quality estimate if its &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-JPEGQualityEstimate&lt;/code&gt; option is invoked. For example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exiftool -JPEGQualityEstimate test_im_050.jpg
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Results in:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;JPEG Quality Estimate           : 50
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A peek at the &lt;a href=&quot;https://github.com/exiftool/exiftool/blob/4981552ec9bf94a0b5a64a06919b5e4f797c208e/lib/Image/ExifTool/JPEGDigest.pm#L2447&quot;&gt;source code&lt;/a&gt; shows it uses a ported version of ImageMagick’s heuristic, so it largely has the same limitations. As an example, running it on my problematic JPEG results in:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;JPEG Quality Estimate           : &amp;lt;unknown&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;approximate-ratios-method&quot;&gt;Approximate Ratios method&lt;/h3&gt;

&lt;p&gt;This method is outlined on &lt;a href=&quot;https://fotoforensics.com/tutorial.php?tt=estq&quot;&gt;Neal Krawetz’s Fotoforensics site&lt;/a&gt;, which also links to a &lt;a href=&quot;https://www.hackerfactor.com/src/jpegquality.c&quot;&gt;an inplementation in C&lt;/a&gt;. A more detailed explanation can be found in Section 3.3.3 of his &lt;a href=&quot;https://blackhat.com/presentations/bh-dc-08/Krawetz/Whitepaper/bh-dc-08-krawetz-WP.pdf&quot;&gt;Digital Image Analysis and Forensics whitepaper&lt;/a&gt;. As Krawetz explains, this method can become unreliable at quality values below 50%, so I didn’t consider it suitable for my use case.&lt;/p&gt;

&lt;h3 id=&quot;cogranne-method&quot;&gt;Cogranne method&lt;/h3&gt;

&lt;p&gt;A &lt;a href=&quot;https://arxiv.org/abs/1802.00992&quot;&gt;2018 paper by Rémi Cogranne&lt;/a&gt; describes an alternative method for estimating JPEG quality. It claims to overcome some of the limitations of established methods, such as the ImageMagick heuristic. However, as it is only valid for a quality factor greater than 49, this also isn’t ideally suited to my use case.&lt;/p&gt;

&lt;h2 id=&quot;scripts-and-test-data&quot;&gt;Scripts and test data&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo&quot;&gt;jpeg-quality-demo Github repository&lt;/a&gt; - Github repo with all scripts and test data that were used in this analysis.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jpeg-quality-demo/blob/main/jpegquality-im-modified.py&quot;&gt;jpegquality-im-modified.py&lt;/a&gt; - Python implementation of modified ImageMagick heuristic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;24 October 2024: re-arranged introductory section, and added an explanation on the difference between quality level and image quality.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;30 October 2024: revised earlier draft in preparation of follow-up post.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;Note that the quality level does not necessarily reflect the &lt;em&gt;image quality&lt;/em&gt;! As an example, if an image was first compressed at 20% quality and subsequently re-saved at 90%, the image quality will be very low (relative to the source image), despite the high quality level at the last save operation. So the quality level only says something about the &lt;em&gt;compression process&lt;/em&gt; that was used on the last save. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;I used &lt;a href=&quot;https://poppler.freedesktop.org/&quot;&gt;Poppler&lt;/a&gt;’s &lt;em&gt;pdfimages&lt;/em&gt; tool to extract the images from the PDF, using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-all&lt;/code&gt; switch which ensures images are kept in their original format. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;I used ImageMagick 6.9.10-23. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;More precisely, each value in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sums&lt;/code&gt; lists represents the sum of all “standard” quantization coefficients for a particular quality level. Similarly, each value in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hashes&lt;/code&gt; lists represents a value of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;qvalue&lt;/code&gt; for a particular quality level in the “standard” tables. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;He mentions this in the context of the “Approximate Ratios” quality estimation method (which indeed becomes unreliable for low qualities). It’s not clear to me if other methods such as the one used by ImageMagick are also affected by this. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2024/10/23/jpeg-quality-estimation-experiments-with-a-modified-imagemagick-heuristic</link>
                <guid>https://bitsgalore.org/2024/10/23/jpeg-quality-estimation-experiments-with-a-modified-imagemagick-heuristic</guid>
                <pubDate>2024-10-23T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Multi-image TIFFs, subfiles and image file directories</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2024/03/confused-muddled-illogical-957696-1024.jpg&quot; alt=&quot;Photograph that shows a hammer that is used to smash a screw into a piece of wood. On the left is a nail that is partially pushed into the same piece of wood, with an adjustable wrench immediately next to it.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://jenikirbyhistory.getarchive.net/media/confused-muddled-illogical-957696&quot;&gt;&quot;Confused, muddled, illogical&quot;&lt;/a&gt;. Used under &lt;a href=&quot;https://web.archive.org/web/20170727004823/https://pixabay.com/en/service/license/&quot;&gt;Pixabay License&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The KB has been using JP2 (JPEG 2000 Part 1) as the primary file format for its mass-digitisation activities for over 15 years now. Nevertheless, we still use uncompressed TIFF for a few collections. At the moment there’s an ongoing discussion about whether we should migrate those to JP2 as well at some point to save storage costs. Last week I ran a small test on a selection of TIFFs from those collections. I first converted them to JP2, and then verified whether no information got lost during the conversion. This resulted in some unexpected surprises, which turned out to be caused by the presence of thumbnail images in some of the source TIFFs. This post discusses the impact of having multiple images indide a TIFF on preservation workflows, and also provides some suggestions on how to identify such files.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;tiff-to-jp2-workflow&quot;&gt;TIFF to JP2 workflow&lt;/h2&gt;

&lt;p&gt;For my tests, I took a selection of 20 test targets in uncompressed TIFF format from five different digitisation batches. For each of these targets, I then:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;converted the source TIFF to lossless JP2 with Kakadu;&lt;/li&gt;
  &lt;li&gt;converted the JP2 from step 1 back to uncompressed TIFF;&lt;/li&gt;
  &lt;li&gt;compared the pixel values of the TIFF from step 2 against those of the source TIFF&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; with &lt;a href=&quot;https://imagemagick.org/script/compare.php&quot;&gt;ImageMagick’s &lt;em&gt;compare&lt;/em&gt; tool&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the pixel comparison in step 3, I used the following general command (using ImageMagick version 6.9.10-23):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;compare &lt;span class=&quot;nt&quot;&gt;-quiet&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-metric&lt;/span&gt; AE source.tif fromjp2.tif null:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This command computes the absolute error count (i.e. the number of different pixels) between both images&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, which is printed to standard error (&lt;a href=&quot;https://en.wikipedia.org/wiki/Standard_streams&quot;&gt;stderr&lt;/a&gt;). The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-quiet&lt;/code&gt; switch is included to suppress warning messages (these are also printed to stderr, and can muddle up the error count output).&lt;/p&gt;

&lt;h2 id=&quot;imagemagick-reports-changed-pixel-values&quot;&gt;ImageMagick reports changed pixel values&lt;/h2&gt;

&lt;p&gt;Since we used lossless compression in our JP2 conversion step, the expected outcome here is that all pixel values are unchanged, and ImageMagick’s AE output metric is exactly zero for all images. For 12 images this was indeed the case. However, for 8 images the AE metric indicated that nearly all pixels had changed. An additional check showed that the computed &lt;a href=&quot;https://en.wikipedia.org/wiki/Peak_signal-to-noise_ratio&quot;&gt;peak signal-to-noise ratio&lt;/a&gt; (PSNR) values for these images were also extremely poor, ranging between 3 and 13. This was definitely unexpected!&lt;/p&gt;

&lt;p&gt;A subsequent visual inspection of the affected images in &lt;a href=&quot;https://www.gimp.org/&quot;&gt;Gimp&lt;/a&gt; did not show any obvious degradation. As an additional check, I also loaded one pair of images in Gimp as separate layers, and then &lt;a href=&quot;https://www.reddit.com/r/GIMP/comments/32dgfq/comment/cqa95dl/&quot;&gt;subtracted one layer from the other one&lt;/a&gt;. This resulted in an image where all RGB values were exactly 0, which is only possible if both images are identical. So what’s going on here?&lt;/p&gt;

&lt;h2 id=&quot;subfiles-and-image-file-directories&quot;&gt;Subfiles and image file directories&lt;/h2&gt;

&lt;p&gt;My first idea was to analyze all source TIFFs with &lt;a href=&quot;https://exiftool.org/&quot;&gt;ExifTool&lt;/a&gt;. For each source TIFF, I used the following general command (using ExifTool version 12.60):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exiftool &lt;span class=&quot;nt&quot;&gt;-X&lt;/span&gt; source.tif &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; source.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;An inspection of the resulting output files showed that all problematic TIFFs contain two separate “image file directories”. Within TIFF, it is possible to bundle multiple images in one single file. This is most commonly done by defining each image as a “subfile”, whose properties are described by a corresponding “image file directory” (&lt;em&gt;IFD&lt;/em&gt;). The individual images can be pages in a multi-page document, or different representations of the same image (e.g. a full size image and a low resolution thumbnail). In our particular case, we have this:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:SubfileType&amp;gt;&lt;/span&gt;Reduced-resolution image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:SubfileType&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:ImageWidth&amp;gt;&lt;/span&gt;160&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:ImageWidth&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:ImageHeight&amp;gt;&lt;/span&gt;126&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:ImageHeight&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:BitsPerSample&amp;gt;&lt;/span&gt;8 8 8&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:BitsPerSample&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:Compression&amp;gt;&lt;/span&gt;Uncompressed&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:Compression&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:PhotometricInterpretation&amp;gt;&lt;/span&gt;RGB&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:PhotometricInterpretation&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:StripOffsets&amp;gt;&lt;/span&gt;6510&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:StripOffsets&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:SamplesPerPixel&amp;gt;&lt;/span&gt;3&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:SamplesPerPixel&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:RowsPerStrip&amp;gt;&lt;/span&gt;126&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:RowsPerStrip&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:StripByteCounts&amp;gt;&lt;/span&gt;60480&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:StripByteCounts&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:PlanarConfiguration&amp;gt;&lt;/span&gt;Chunky&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:PlanarConfiguration&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:ThumbnailTIFF&amp;gt;&lt;/span&gt;(Binary data 60696 bytes, use -b option to extract)&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:ThumbnailTIFF&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, all “IFD1” tags describe a subfile that is a reduced resolution thumbnail image (this is indicated by the &lt;em&gt;SubfileType&lt;/em&gt; tag and its value). The full-resolution image is described by “IFD0”, which is the first image file directory:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:ImageWidth&amp;gt;&lt;/span&gt;9458&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:ImageWidth&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:ImageHeight&amp;gt;&lt;/span&gt;7429&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:ImageHeight&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:BitsPerSample&amp;gt;&lt;/span&gt;8 8 8&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:BitsPerSample&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:Compression&amp;gt;&lt;/span&gt;Uncompressed&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:Compression&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:PhotometricInterpretation&amp;gt;&lt;/span&gt;RGB&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:PhotometricInterpretation&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:Make&amp;gt;&lt;/span&gt;Leaf&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:Make&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:Model&amp;gt;&lt;/span&gt;Leaf Aptus-II 12R(LI201033   )/Other&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:Model&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:StripOffsets&amp;gt;&lt;/span&gt;(Binary data 72 bytes, use -b option to extract)&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:StripOffsets&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:Orientation&amp;gt;&lt;/span&gt;Horizontal (normal)&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:Orientation&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:SamplesPerPixel&amp;gt;&lt;/span&gt;3&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:SamplesPerPixel&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:RowsPerStrip&amp;gt;&lt;/span&gt;1024&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:RowsPerStrip&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:StripByteCounts&amp;gt;&lt;/span&gt;(Binary data 70 bytes, use -b option to extract)&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:StripByteCounts&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:XResolution&amp;gt;&lt;/span&gt;300&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:XResolution&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:YResolution&amp;gt;&lt;/span&gt;300&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:YResolution&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:PlanarConfiguration&amp;gt;&lt;/span&gt;Chunky&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:PlanarConfiguration&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:ResolutionUnit&amp;gt;&lt;/span&gt;inches&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:ResolutionUnit&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:Software&amp;gt;&lt;/span&gt;Capture One 8 Macintosh&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:Software&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;subifds&quot;&gt;SubIFDs&lt;/h2&gt;

&lt;p&gt;For completeness, I should mention here that an alternative way to store multiple images in a TIFF, is by combining them in one single subfile. The corresponding image file directory then contains &lt;em&gt;SubIFD&lt;/em&gt; child tags, each of which describes an individual image. The &lt;em&gt;SubIFD&lt;/em&gt; tag is defined in the &lt;a href=&quot;https://www.awaresystems.be/imaging/tiff/specification/TIFFPM6.pdf&quot;&gt;Adobe PageMaker 6.0 TIFF Technical Notes&lt;/a&gt;. According to the documentation of the &lt;a href=&quot;https://libtiff.gitlab.io/libtiff/&quot;&gt;LibTIFF&lt;/a&gt; library, &lt;a href=&quot;https://libtiff.gitlab.io/libtiff/multi_page.html&quot;&gt;“SubIFD chains are rarely supported”&lt;/a&gt;. I also couldn’t find any example files that use them.&lt;/p&gt;

&lt;h2 id=&quot;remove-thumbnail-with-exiftool&quot;&gt;Remove thumbnail with ExifTool&lt;/h2&gt;

&lt;p&gt;To test whether the presence of the thumbnail is indeed the cause of my weird pixel check results, I removed it from (a copy of) one of the problematic source TIFFs with the following ExifTool command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exiftool &lt;span class=&quot;nt&quot;&gt;-ifd1&lt;/span&gt;:all&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; source.tif
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that this removes &lt;em&gt;IFD1&lt;/em&gt; (which is the second &lt;em&gt;IFD&lt;/em&gt;) and its associated data.&lt;/p&gt;

&lt;p&gt;When I re-ran ImageMagick’s &lt;em&gt;compare&lt;/em&gt; tool with the modified file, the output was as expected: the absolute error count was 0, and the PSNR value “inf”&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;pixel-compare-with-unmodified-source-tiffs&quot;&gt;Pixel compare with unmodified source TIFFs&lt;/h2&gt;

&lt;p&gt;Even though removing the thumbnail gets the job done, it adds another processing step to our workflow. Fortunately, a &lt;a href=&quot;https://github.com/ImageMagick/ImageMagick/discussions/3279#discussioncomment-387226&quot;&gt;little digging&lt;/a&gt; revealed that it’s possible to explicitly address individual images in a multi-image file in ImageMagick. If we (only) want to use the first image in both our input images, we can call the compare tool like this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;compare &lt;span class=&quot;nt&quot;&gt;-quiet&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-metric&lt;/span&gt; AE source.tif[0] fromjp2.tif[0] null:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that the number between square brackets sets the index of the subfile that is used for the comparison. Applying this to our unmodified source TIFFs, this indeed results in an absolute error count of 0 for all image pairs. I implemented the above command in a simple &lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/pixelCheck-im.sh&quot;&gt;test script&lt;/a&gt; that compares two directory trees with TIFF images, and reports the results to a comma-delimited file.&lt;/p&gt;

&lt;h2 id=&quot;imagemagick-reads-last-subfile-by-default&quot;&gt;ImageMagick reads last subfile by default&lt;/h2&gt;

&lt;p&gt;As a final test, I deliberately instructed ImageMagick to use the second subfile (i.e. the thumbnail) of the source TIFF as a basis for the comparison:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;compare &lt;span class=&quot;nt&quot;&gt;-quiet&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-metric&lt;/span&gt; AE source.tif[1] fromjp2.tif[0] null:
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The resulting (large) AE value is identical to what is reported if the subfiles are not set by the user at all. So, it seems that by default ImageMagick uses the &lt;em&gt;last&lt;/em&gt; subfile of each input image, and that this is the root cause of the unexpected behaviour!&lt;/p&gt;

&lt;h2 id=&quot;are-multi-image-tiffs-a-preservation-risk&quot;&gt;Are multi-image TIFFs a preservation risk?&lt;/h2&gt;

&lt;p&gt;ImageMagick’s behaviour made me wonder to what extent the presence of multiple images poses a preservation risk. In our specific example, the second IFD only represents a thumbnail, and losing this in a format migration isn’t such a big deal.&lt;/p&gt;

&lt;p&gt;Section 7 of the &lt;a href=&quot;https://web.archive.org/web/20180810205359/https://www.adobe.io/content/udp/en/open/standards/TIFF/_jcr_content/contentbody/download/file.res/TIFF6.pdf&quot;&gt;TIFF 6.0 specification&lt;/a&gt; describes how baseline TIFF readers should handle files with multiple images:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;TIFF readers must be prepared for multiple images (subfiles) per TIFF file,
although they are not required to do anything with images after the first one.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For the thumbnail case, I can think of two potential problems:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;If the writer application mistakenly wrote the thumbnail as the first subfile (and the full resolution images as the last one), a migration tool that is based on a conforming reader would then migrate the thumbnail, and ignore the full resolution image data.&lt;/li&gt;
  &lt;li&gt;If a migration tool mistakenly reads the last subfile instead of the first one, this would also result in the loss of data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I expect the above problems are largely hypothetical, especially since TIFF has been around for such a long time. But I wouldn’t completely rule them out either. Importantly, such errors could pass unnoticed by image comparison tools like ImageMagick’s &lt;em&gt;compare&lt;/em&gt; if they use the wrong subfile as a comparison basis. Fortunately, the extremely small size of the resulting image files would be a pretty obvious clue that something is amiss.&lt;/p&gt;

&lt;p&gt;TIFFs with multiple images that represent pages in a multi-page document could pose a larger risk. I don’t really expect any of these to show up in our own collections&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, but the situation might be different for other collections.&lt;/p&gt;

&lt;p&gt;Finally, TIFFs with subfiles that in turn contain multiple images through the use of &lt;em&gt;SubIFD&lt;/em&gt; tags add another layer of complexity.&lt;/p&gt;

&lt;p&gt;So, depending on your specific situation, it might be prudent to check your TIFFs for the presence of multiple images before doing any preservation actions on them, like a migration to some other format.&lt;/p&gt;

&lt;h2 id=&quot;detection-of-multi-image-tiffs&quot;&gt;Detection of multi-image TIFFs&lt;/h2&gt;

&lt;p&gt;In order to detect the presence of multiple images, we have a couple of options.&lt;/p&gt;

&lt;h3 id=&quot;exiftool&quot;&gt;ExifTool&lt;/h3&gt;

&lt;p&gt;With &lt;a href=&quot;https://exiftool.org/&quot;&gt;ExifTool&lt;/a&gt;, we can use the aforementioned command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exiftool &lt;span class=&quot;nt&quot;&gt;-X&lt;/span&gt; source.tif &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; source.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As an example, here’s some output for a 6-page TIFF (&lt;a href=&quot;https://www.leadtools.com/support/forum/resource.ashx?a=544&amp;amp;b=1&quot;&gt;file available here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:SubfileType&amp;gt;&lt;/span&gt;Full-resolution image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:SubfileType&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:ImageWidth&amp;gt;&lt;/span&gt;363&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:ImageWidth&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:ImageHeight&amp;gt;&lt;/span&gt;7429&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:ImageHeight&amp;gt;&lt;/span&gt;
  ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:SubfileType&amp;gt;&lt;/span&gt;Full-resolution image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:SubfileType&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:ImageWidth&amp;gt;&lt;/span&gt;363&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:ImageWidth&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:ImageHeight&amp;gt;&lt;/span&gt;382&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:ImageHeight&amp;gt;&lt;/span&gt;
  ::
  ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD5:SubfileType&amp;gt;&lt;/span&gt;Full-resolution image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD5:SubfileType&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD5:ImageWidth&amp;gt;&lt;/span&gt;363&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD5:ImageWidth&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD5:ImageHeight&amp;gt;&lt;/span&gt;382&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD5:ImageHeight&amp;gt;&lt;/span&gt;
  ::
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If the output contains one or more &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;IFD#:SubfileType&amp;gt;&lt;/code&gt; tags (where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;#&lt;/code&gt; represents an integer number), this indicates the file contains subfiles. It’s worth noting that page 41 of the &lt;a href=&quot;https://web.archive.org/web/20180810205359/https://www.adobe.io/content/udp/en/open/standards/TIFF/_jcr_content/contentbody/download/file.res/TIFF6.pdf&quot;&gt;TIFF 6.0 specification&lt;/a&gt; mentions that the &lt;em&gt;SubfileType&lt;/em&gt; tag is deprecated, and the &lt;em&gt;NewSubfileType&lt;/em&gt; tag should be used instead. So why does ExifTool report &lt;em&gt;SubFileType&lt;/em&gt;? A quick look at &lt;a href=&quot;https://exiftool.org/TagNames/EXIF.html&quot;&gt;ExifTool’s documentation&lt;/a&gt; reveals that ExifTool’s &lt;em&gt;SubfileType&lt;/em&gt; output actually corresponds to the TIFF spec’s &lt;em&gt;NewSubfileType&lt;/em&gt; tag. If a TIFF contains the deprecated tag (&lt;em&gt;SubfileType&lt;/em&gt; in the TIFF 6.0 spec), ExifTool reports this as &lt;em&gt;OldSubfileType&lt;/em&gt; (yes, this is a bit confusing!).&lt;/p&gt;

&lt;p&gt;I also ran Exiftool on &lt;a href=&quot;https://www.leadtools.com/support/forum/resource.ashx?a=545&amp;amp;b=1&quot;&gt;this 11-page TIFF&lt;/a&gt;, which uses a different compression type for each page. The compression types can be inferred from the &lt;em&gt;Compression&lt;/em&gt; fields in ExifTool’s  output:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:SubfileType&amp;gt;&lt;/span&gt;Single page of multi-page image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:SubfileType&amp;gt;&lt;/span&gt;
    ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD0:Compression&amp;gt;&lt;/span&gt;Uncompressed&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD0:Compression&amp;gt;&lt;/span&gt;
    ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD1:SubfileType&amp;gt;&lt;/span&gt;Single page of multi-page image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD1:SubfileType&amp;gt;&lt;/span&gt;
    ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD2:SubfileType&amp;gt;&lt;/span&gt;Single page of multi-page image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD2:SubfileType&amp;gt;&lt;/span&gt;
    ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD2:Compression&amp;gt;&lt;/span&gt;JPEG 2000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD2:Compression&amp;gt;&lt;/span&gt;
    ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD3:SubfileType&amp;gt;&lt;/span&gt;Single page of multi-page image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD3:SubfileType&amp;gt;&lt;/span&gt;
    ::
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;IFD3:Compression&amp;gt;&lt;/span&gt;JPEG&lt;span class=&quot;nt&quot;&gt;&amp;lt;/IFD3:Compression&amp;gt;&lt;/span&gt;
    ::
    ::
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Also, as explained above, a single subfile can in turn contain multiple images, each of which are described by a &lt;em&gt;SubIFD&lt;/em&gt; tag. I couldn’t find any example files that use &lt;em&gt;SubIFDs&lt;/em&gt;, but ExifTool reports them according to &lt;a href=&quot;https://exiftool.org/TagNames/EXIF.html&quot;&gt;the documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So, if you just want to cover all of the aforementioned cases, you might want to check your TIFFs for the presence of:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;em&gt;IFD&lt;/em&gt; with &lt;em&gt;SubfileType&lt;/em&gt; tag (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;IFD#:SubfileType&amp;gt;&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;IFD&lt;/em&gt; with &lt;em&gt;OldSubfileType&lt;/em&gt; tag (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;IFD#:OldSubfileType&amp;gt;&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;SubIFD&lt;/em&gt; tag.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;imagemagick-identify&quot;&gt;ImageMagick identify&lt;/h3&gt;

&lt;p&gt;You can also identify the presence of multiple images with ImageMagick’s &lt;a href=&quot;https://imagemagick.org/script/identify.php&quot;&gt;&lt;em&gt;identify&lt;/em&gt;&lt;/a&gt; tool (via &lt;a href=&quot;https://preservation.tylerthorsted.com/2023/09/15/tiff/&quot;&gt;Tyler Thorsted&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;identify &lt;span class=&quot;nt&quot;&gt;-quiet&lt;/span&gt; source.tif
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;source.tif[0] TIFF 9458x7429 9458x7429+0+0 8-bit sRGB 201.089MiB 0.000u 0:00.010
source.tif[1] TIFF 160x126 160x126+0+0 8-bit sRGB 0.000u 0:00.010
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However, running this on the 11-page multi-format TIFF gave me the following error (with no meaningful information on the number of images):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;identify-im6.q16: compression not supported `MultipleFormats.tif&apos; @ error/tiff.c/ReadTIFFImage/1433.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It’s also not clear to me how ImageMagick handles &lt;em&gt;SubIFD&lt;/em&gt; tags.&lt;/p&gt;

&lt;h3 id=&quot;jhove&quot;&gt;JHOVE&lt;/h3&gt;

&lt;p&gt;The output of &lt;a href=&quot;https://jhove.openpreservation.org/&quot;&gt;JHOVE&lt;/a&gt;’s TIFF module also provides information on subfiles. You can use the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;jhove &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; TIFF-hul &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; source.tif &lt;span class=&quot;nt&quot;&gt;-h&lt;/span&gt; xml &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; source-jhove.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;JHOVE (version 1.28.0) reports each &lt;em&gt;NewSubfileType&lt;/em&gt; tag as:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;NewSubfileType&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Long&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Unlike ExifTool, JHOVE also reports a &lt;em&gt;NewSubfileType&lt;/em&gt; property for the main image, which means the output always contains at least one &lt;em&gt;NewSubfileType&lt;/em&gt; property. So, TIFFs with subfiles can be singled out by the presence of more than one &lt;em&gt;NewSubfileType&lt;/em&gt; property in JHOVE’s output.&lt;/p&gt;

&lt;p&gt;I noticed that JHOVE’s &lt;em&gt;NewSubfileType&lt;/em&gt; output doesn’t always describe the role of each subfile. For example, for a TIFF with a subfile that is a thumbnail, JHOVE reports this (which is as expected):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;NewSubfileType&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;reduced-resolution image of another image in this file&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But for &lt;a href=&quot;https://www.leadtools.com/support/forum/resource.ashx?a=544&amp;amp;b=1&quot;&gt;this 6-page TIFF&lt;/a&gt;, the &lt;em&gt;NewSubfileType&lt;/em&gt; tag of each TIFF is reported as:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;NewSubfileType&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Long&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Unlike ExifTool, JHOVE’s output doesn’t provide any direct clue here that each subfile is a full-resolution image. Oddly, for &lt;a href=&quot;https://www.leadtools.com/support/forum/resource.ashx?a=545&amp;amp;b=1&quot;&gt;this 11-page multi-format TIFF&lt;/a&gt;, JHOVE &lt;em&gt;does&lt;/em&gt; report that each subfile is a single page:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;NewSubfileType&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;single page of multi-page image&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Further perusal reveals that JHOVE’s output on compression is also incomplete for this file: for 3 pages, it reports an “unknown” compression scheme:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;mix:Compression&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;mix:compressionScheme&amp;gt;&lt;/span&gt;Unknown&lt;span class=&quot;nt&quot;&gt;&amp;lt;/mix:compressionScheme&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/mix:Compression&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This happens for pages with JBIG, JPEG and JPEG 2000 compression.&lt;/p&gt;

&lt;p&gt;Again, if you’re working with very old TIFFs, you might also want to check JHOVE’s output for the presence of &lt;em&gt;SubfileType&lt;/em&gt; properties, which represent the deprecated TIFF tag discussed in the ExifTool section.&lt;/p&gt;

&lt;p&gt;JHOVE’s &lt;a href=&quot;https://jhove.openpreservation.org/modules/tiff/&quot;&gt;documentation&lt;/a&gt; doesn’t mention the &lt;em&gt;SubIFD&lt;/em&gt; tag, so my best guess is this isn’t supported.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The presence of multiple images in TIFF files, either as thumbnails or pages in a multi-page document, can result in unexpected software behaviour. Because of this, it is important to identify them in any TIFF-based preservation workflows. Based on these tests with ExifTool, ImageMagick and JHOVE, it appears that ExifTool is the most reliable, robust and complete tool for identifying multi-image TIFFS. ExifTool was also the only tool that was able to correctly identify the compression scheme of each image in a particularly challenging 11-page TIFF with multiple compression schemes.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://preservation.tylerthorsted.com/2023/09/15/tiff/&quot;&gt;Blog post on TIFF by Tyler Thorsted&lt;/a&gt; (discusses TIFF preservation risks, including multi-image files)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.leadtools.com/support/forum/posts/t10960-&quot;&gt;Multipage TIFF sample files from Leadtools&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://web.archive.org/web/20180810205359/https://www.adobe.io/content/udp/en/open/standards/TIFF/_jcr_content/contentbody/download/file.res/TIFF6.pdf&quot;&gt;TIFF 6.0 specification&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.awaresystems.be/imaging/tiff/specification/TIFFPM6.pdf&quot;&gt;Adobe PageMaker 6.0 TIFF Technical Notes&lt;/a&gt; (defines &lt;em&gt;SubIFDs&lt;/em&gt;)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/pixelCheck-im.sh&quot;&gt;Test script (bash)&lt;/a&gt; that performs a pixel comparison on two directory trees with TIFF images.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;You might wonder why I didn’t do the comparison directly between the source TIFF and the JP2, which would eliminate step 2. The reason for this is, that the ImageMagick builds don’t include JPEG 2000 support by default. The only way to make this work is to build ImageMagick from its source. This requires that all the necessary development libraries are installed, and ImageMagick’s build configuration is set up to include JPEG 2000 support. Even though I’ve successfully managed to make this work in the past, the procedure is quite time consuming, and I didn’t want to go through it right now. So, for this test I used the round-trip conversion to TIFF as a workaround. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;An overview of all available &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-metric&lt;/code&gt; values can be found &lt;a href=&quot;https://imagemagick.org/script/command-line-options.php#metric&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;PSNR approaches infinity if both inputs are identical, which is exactly what you would expect for lossless compression. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;This is mainly because these were created according to very specific guidelines. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2024/03/11/multi-image-tiffs-subfiles-and-image-file-directories</link>
                <guid>https://bitsgalore.org/2024/03/11/multi-image-tiffs-subfiles-and-image-file-directories</guid>
                <pubDate>2024-03-11T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>VeraPDF parse status as a proxy for PDF rendering&#58; experiments with the Synthetic PDF Testset</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/06/barnum-bailey.jpg&quot; alt=&quot;Vintage lithograph circus poster that shows a circus ring. In the front is a woman in a red dress, standing on horseback. Behind her there are more horses, with a variety of circus artists, including acrobats and jugglers, performing on horseback as well. In the background acrobats are walking on a tightrope.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://www.flickr.com/photos/boston_public_library/6554392117/in/album-72157629549177588/&quot;&gt;&quot;The Barnum &amp;amp; Bailey greatest show on earth&quot;&lt;/a&gt;. Used under &lt;a href=&quot;https://creativecommons.org/licenses/by/2.0/&quot;&gt;CC BY-BY 2.0&lt;/a&gt;, via Boston Public Library.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Last month I &lt;a href=&quot;/2023/05/25/identification-of-pdf-preservation-risks-with-verapdf-and-jhove&quot;&gt;wrote this post&lt;/a&gt;, which addresses the use of &lt;a href=&quot;https://jhove.openpreservation.org/&quot;&gt;JHOVE&lt;/a&gt; and &lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt; for identifying preservation risks in PDF files. In the concluding section I suggested that VeraPDF’s parse status might be used as a rough “validity proxy” to identify malformed PDFs. But does VeraPDF’s parse status actually have any predictive value for rendering? And how does this compare to what JHOVE tells us? This post is a first attempt at answering these questions, using data from the &lt;a href=&quot;https://www.radar-service.eu/radar/en/dataset/JtlOdwQquZWDqQdq&quot;&gt;Synthetic PDF Testset for File Format Validation&lt;/a&gt; by Lindlar, Tunnat and Wilson.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;goal-and-objectives-of-this-post&quot;&gt;Goal and objectives of this post&lt;/h2&gt;

&lt;p&gt;The main goal of this post is to get more insight into the associations between VeraPDF parse errors and rendering behaviour, and to see if the occurrence of these parse errors can be used as a rough “validity proxy” to identify malformed PDFs. For this I propose a tentative methodology, and apply this to a previously published annotated dataset of synthetic PDF files. I then try to use the results to answer the following questions:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;To what extent is VeraPDF’s reported parse status associated with rendering behaviour in Adobe Acrobat?&lt;/li&gt;
  &lt;li&gt;How does this compare against the occurrence of JHOVE validation errors?&lt;/li&gt;
  &lt;li&gt;What are the implications of the answers to 1. and 2. for preservation workflows?&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;synthetic-pdf-testset-for-file-format-validation&quot;&gt;Synthetic PDF Testset for File Format Validation&lt;/h2&gt;

&lt;p&gt;In 2017, Lindlar, Tunnat and Wilson published the &lt;a href=&quot;https://www.radar-service.eu/radar/en/dataset/JtlOdwQquZWDqQdq&quot;&gt;Synthetic PDF Testset for File Format Validation&lt;/a&gt;. It is a corpus of 88 small, hand-crafted PDFs, that each violate the PDF format specification in different ways. The focus of the dataset is on “basic structure violations at the header, trailer, document catalog, page tree node and cross-reference levels”, and it also includes “violations at the page node, page resource and stream object level”. The dataset comes with a descriptive spreadsheet, which includes a column with rendering behaviour of each file in Adobe Acrobat Professional XI Pro (11.0.15). The dataset was specifically created for testing the validation functionality of JHOVE’s PDF module. A detailed discussion of this dataset can be found in the authors’ &lt;a href=&quot;https://phaidra.univie.ac.at/detail/o:931074&quot;&gt;2017 iPRES paper&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;running-jhove-and-verapdf&quot;&gt;Running JHOVE and VeraPDF&lt;/h2&gt;

&lt;p&gt;As a first test, I ran all PDFs in the dataset through JHOVE (v 1.28.0) and VeraPDF (v 1.22.3). For JHOVE I used the following generic command line arguments:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;jhove -m PDF-hul -h XML -i whatever.pdf whatever-jhove.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And for VeraPDF:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf --off --addlogs --extract whatever.pdf &amp;gt; whatever-vera.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--off&lt;/code&gt; switch disables (PDF/A or PDF/UA) validation, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--addlogs&lt;/code&gt; switch ensures that information about any run-time warnings is included in the output. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--extract&lt;/code&gt; switch enables feature extraction&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. The raw JHOVE and VeraPDF output files are available &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/tree/main/output/lindlar-tunnat-wilson&quot;&gt;here&lt;/a&gt;. I created &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/scripts/jhove-verapdf-validation-run.py&quot;&gt;this script&lt;/a&gt;, which runs JHOVE and VeraPDF for all files in the dataset, and creates a &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/lindlar-tunnat-wilson/data.csv&quot;&gt;comma-delimited file&lt;/a&gt; with the following output metrics for each PDF:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The JHOVE validation status&lt;/li&gt;
  &lt;li&gt;A Boolean flag that is &lt;em&gt;True&lt;/em&gt; if VeraPDF reported one or more parse errors, and &lt;em&gt;False&lt;/em&gt; otherwise&lt;/li&gt;
  &lt;li&gt;A Boolean flag that is &lt;em&gt;True&lt;/em&gt; if VeraPDF reported one or more warnings, and &lt;em&gt;False&lt;/em&gt; otherwise&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;jhove-validation-status-vs-verapdf-parse-errors-and-warnings&quot;&gt;JHOVE validation status vs VeraPDF parse errors and warnings&lt;/h2&gt;

&lt;p&gt;First, let’s look at to what degree the JHOVE and VeraPDF results are interrelated. For this, I created two contingency tables that cross-tabulate the JHOVE and VeraPDF results. First, here are the JHOVE validation status outcomes&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; versus the VeraPDF parse error flags:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;JHOVE status&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;VeraPDF parse errors&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;No VeraPDF parse errors&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;All&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Not well-formed&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;62&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;69&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Well-Formed, but not valid&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Well-Formed and valid&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;9&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;13&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;All&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;68&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;20&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Most PDFs that JHOVE considers “Not well-formed” also result in VeraPDF parse errors, which isn’t unexpected. Interestingly, out of the 13 PDFs that JHOVE considers “Well-Formed and valid”, 4 nevertheless result in VeraPDF parse errors. Most of the PDFs that JHOVE considers “Well-Formed, but not valid” don’t result in veraPDF parse errors.&lt;/p&gt;

&lt;p&gt;Here’s the contingency table of JHOVE validation status outcomes versus the VeraPDF warning flags:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;JHOVE status&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;VeraPDF warnings&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;No VeraPDF warnings&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;All&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Not well-formed&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;62&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;69&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Well-Formed, but not valid&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Well-Formed and valid&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;8&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;13&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;All&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;69&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;19&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The results are almost identical to those of the VeraPDF parse errors. The only difference here, is that out of the PDFs that are “Well-Formed and valid” according to JHOVE, 5 resulted in VeraPDF warnings, whereas only 4 resulted in VeraPDF parse errors.&lt;/p&gt;

&lt;p&gt;We can also use statistical measures to characterise the association between the JHOVE and VeraPDF metrics. A suitable measure for nominal-scale variables is &lt;a href=&quot;https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_V&quot;&gt;Cramér’s &lt;em&gt;V&lt;/em&gt;&lt;/a&gt;. Cramér’s &lt;em&gt;V&lt;/em&gt; can have any value from 0 to 1, where 0 indicates no association between two variables, and 1 a complete association. The following table shows the calculated Cramér’s V&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; values between JHOVE status and VeraPDF parse errors, and JHOVE status and VeraPDF warnings, respectively:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Metrics&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Cramér’s V&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;p&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;df&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;N&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JHOVE status vs VeraPDF parse errors&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.56&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&amp;lt;.001&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JHOVE status vs VeraPDF warnings&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.51&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&amp;lt;.001&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The &lt;em&gt;p&lt;/em&gt;-values were calculated from a &lt;a href=&quot;https://en.wikipedia.org/wiki/Chi-squared_test&quot;&gt;Chi-square test&lt;/a&gt; of independence between the JHOVE and VeraPDF metrics. Put simply, for each metric pair, they denote the probability of the observed metric combinations occurring for the null hypothesis that the metrics are independent. Since both &lt;em&gt;p&lt;/em&gt;-values are very small, this adds credence that the null hypothesis must be rejected, and that both metrics are in fact related. The Cramér’s &lt;em&gt;V&lt;/em&gt; values express the strength of this association, confirming the relatively strong association between JHOVE’s validation status and the occurrence of VeraPDF parse errors and warnings. The &lt;em&gt;df&lt;/em&gt; column shows the degrees of freedom of the Chi-square distribution.&lt;/p&gt;

&lt;h2 id=&quot;simplifying-the-ground-truth&quot;&gt;Simplifying the ground truth&lt;/h2&gt;

&lt;p&gt;Since the spreadsheet that is part of the PDF Testset includes information on rendering behaviour in Acrobat, we can use this as “ground truth” to test to what degree the JHOVE and VeraPDF results are indicative of rendering problems. The rendering outcomes in the spreadsheet&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; are not based on any controlled vocabulary. For about half of the files, rendering behaviour is described as a simple “Yes” or “No”. For the remaining files, it is described in more elaborate terms, such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“Yes, but missing content”&lt;/li&gt;
  &lt;li&gt;“Yes. Wrong font”&lt;/li&gt;
  &lt;li&gt;“Yes. Adobe tries to change something when opening”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This multitude of very specific (and often unique) values makes it hard to do any meaningful (quantitative) analysis. Because of this,  I simplified the rendering outcomes into 3 discrete classes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Does not render&lt;/li&gt;
  &lt;li&gt;Renders with issues&lt;/li&gt;
  &lt;li&gt;Renders normally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/misc/lindlar-tunnat-wilson/lindlar-tunnat-wilson-jhove-vera-rendering.csv&quot;&gt;This file&lt;/a&gt; contains, for all PDFs in the dataset, the simplified rendering outcomes, along with the JHOVE and VeraPDF results.&lt;/p&gt;

&lt;h2 id=&quot;jhove-and-verapdf-results-vs-rendering-behaviour&quot;&gt;JHOVE and VeraPDF results vs rendering behaviour&lt;/h2&gt;

&lt;p&gt;As a first step towards exploring any associations between the JHOVE and VeraPDF metrics and rendering behaviour, I used this file to create some more contingency tables. The first one shows the joint frequencies of rendering outcomes against JHOVE validation status:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Rendering \ JHOVE status&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Not well-formed&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Well-Formed, but not valid&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Well-Formed and valid&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;All&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Does not render&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;31&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;32&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Renders with issues&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;27&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;9&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;38&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Renders normally&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;11&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;All&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;69&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;6&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;13&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Out of the 32 files in the “Does not render” category, JHOVE considers 31 as “Not well-formed”, and 1 as “Well-Formed, but not valid”. JHOVE was less successful at identifying files in the “Renders with issues” category. Out the 38 files, 9 (24%) are “Well-Formed and valid”. One striking result is that from the 18 files in the “Renders normally” category, the majority (11, or 61%) are “Not well-formed” according to JHOVE.&lt;/p&gt;

&lt;p&gt;Here’s the table for the veraPDF parse errors:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Rendering \ VeraPDF status&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Parse errors&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;No parse errors&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;All&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Does not render&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;30&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;32&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Renders with issues&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;31&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;38&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Renders normally&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;11&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;All&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;68&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;20&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;VeraPDF’s parse errors output is slightly less effective at identifying PDFs in the “Does not render” category: 30 of these result in parse errors, and 2 in no parse errors. From the 38 files in the “Renders with issues” category, 7 (18%) don’t result in parse errors. Interestingly, 61% of the files in the “Renders normally” category (11 out of 18) does not result in parse errors. This is a marked contrast with the JHOVE, which shows the opposite behaviour by considering 61% of these files “Not well-formed”.&lt;/p&gt;

&lt;p&gt;And finally here’s the table for VeraPDF’s warnings:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Rendering \ VeraPDF status&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Warnings&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;No warnings&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;All&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Does not render&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;30&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;32&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Renders with issues&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;31&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;38&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Renders normally&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;8&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;10&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;All&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;69&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;19&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The pattern here is nearly identical to the parse errors&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;As before, we can use the Cramer’s &lt;em&gt;V&lt;/em&gt; statistic to express the strength of the associations between the tool metrics and the rendering results:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Tool metric&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Cramér’s V&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;p&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;df&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;N&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JHOVE status&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.23&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;.011&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;VeraPDF parse errors&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.46&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&amp;lt;.001&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;VeraPDF warnings&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.41&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;&amp;lt;.001&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;88&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The low &lt;em&gt;p&lt;/em&gt;-values (again from a Chi-square test) indicate that all tool metrics are related to the rendering results. The strength of these associations is expressed through Cramér’s &lt;em&gt;V&lt;/em&gt;. From the table we see that both VeraPDF metrics show a stronger association with rendering behaviour, compared to JHOVE. This pattern persists (although to a slightly lesser degree) if we adjust for the fact that JHOVE’s validation status can have 3 possible values, whereas the VerapDF metrics are simply binary flags&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. The large &lt;em&gt;V&lt;/em&gt; value for the VeraPDF parse errors metric is mostly caused by the large overlap between the files in the “No parse errors” and “Renders normally” categories.&lt;/p&gt;

&lt;h2 id=&quot;interpretation-and-conclusions&quot;&gt;Interpretation and conclusions&lt;/h2&gt;

&lt;h3 id=&quot;verapdf-vs-jhove&quot;&gt;VeraPDF vs JHOVE&lt;/h3&gt;

&lt;p&gt;Given the higher Cramér’s &lt;em&gt;V&lt;/em&gt; values, it would be tempting to conclude that VeraPDF is the “better” tool here compared to JHOVE. However, such a conclusion would be overly simplistic. The higher &lt;em&gt;V&lt;/em&gt; values are primarily the result of the association between the “Renders normally” rendering class and VeraPDF’s “No parse errors” class&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;. By contrast, the majority of these “Renders normally” PDFs fall into JHOVE’s “Not well-formed” class. This does not indicate any shortcoming on JHOVE’s part. Instead, it simply suggests that JHOVE is more sensitive to (minor) deviations from the PDF specification that don’t have a direct impact on rendering. This is exactly what we would expect if the rendering (viewer) application follows &lt;a href=&quot;https://blog.dshr.org/2009/01/postels-law.html&quot;&gt;Postel’s Law&lt;/a&gt; (also known as the robustness principle), which states:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;be conservative in what you do, be liberal in what you accept from others&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Unlike most rendering applications (which are designed to be forgiving when it comes to small deviations from the specifications), JHOVE is meant to be picky. The primary purpose of VeraPDF’s parser is to provide the information needed to check conformance against PDF profiles (such as PDF/A and PDF/UA), and to extract technical features. It was &lt;em&gt;not&lt;/em&gt; designed with JHOVE-style low-level (PDF 1.x … PDF 2.0) format validation in mind. Therefore, its behaviour may be more similar to the parsers that are used by rendering applications.&lt;/p&gt;

&lt;h3 id=&quot;implications-for-digital-preservation-workflows&quot;&gt;Implications for digital preservation workflows&lt;/h3&gt;

&lt;p&gt;Whether this behaviour is desirable within a digital preservation workflow highly depends on the specific purpose for which we want to “validate”. If the objective is to identify PDFs that are seriously malformed, VeraPDF’s parse status provides a simple and most likely sufficient indicator. JHOVE’s validation output gives more detail, but its interpretation can be difficult. Also, since JHOVE hasn’t yet been made up-to-date to &lt;a href=&quot;https://jhove.openpreservation.org/modules/pdf/&quot;&gt;neither PDF 1.7&lt;/a&gt; nor PDF 2.0, JHOVE’s results for those PDF versions may be misleading.&lt;/p&gt;

&lt;p&gt;With that said, Postel’s law also implies that more subtle rendering issues may go unnoticed. Since the PDF standard is so feature-rich, PDF rendering software is typically designed to ignore any unknown features in a file. This can result in incomplete rendering, without any errors being reported in the process. Since VeraPDF’s parser appears to behave similarly to the parsers used by rendering applications, we should expect that subtle deviations from the standards may not result in any VeraPDF parse errors.&lt;/p&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;The above analysis is a first attempt at exploring the association between VeraPDF’s parse errors and rendering behaviour. It’s important to keep in mind the following limitations:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;It is based exclusively on small, synthetic test files. It’s unclear how these results translate to ordinary PDFs “in the wild”, which are typically much more complex.&lt;/li&gt;
  &lt;li&gt;The ground truth (rendering results) was taken at face value from the 2017 data set, which was based on a version of Adobe Acrobat that is now outdated. Results may be different for more recent versions, or other rendering applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;jhove-verapdf-and-the-future-of-pdf-validation&quot;&gt;JHOVE, VeraPDF and the future of PDF validation&lt;/h2&gt;

&lt;p&gt;While working on this write-up, the Open Preservation Foundation and the PDF Association &lt;a href=&quot;https://openpreservation.org/news/development-preview-pdf-file-checker-based-on-the-arlington-pdf-model/&quot;&gt;released&lt;/a&gt; a first development preview of a new veraPDF-powered PDF-checker that is based on the &lt;a href=&quot;https://github.com/pdf-association/arlington-pdf-model&quot;&gt;Arlington PDF model&lt;/a&gt;. This software is capable of analyzing PDF files against the full PDF 2.0 (and earlier) specifications. An earlier draft of this post included an overview of the current PDF validation software landscape, with some reflections on where things may be heading. However, I ultimately left this out, as it would make this post too unwieldy. I will most likely address this in another follow-up post.&lt;/p&gt;

&lt;h2 id=&quot;feedback-welcome&quot;&gt;Feedback welcome&lt;/h2&gt;

&lt;p&gt;As always, feedback to this post is highly appreciated. This includes (but is not limited to):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The idea of using VeraPDF parse status a rough proxy for PDF validity.&lt;/li&gt;
  &lt;li&gt;Comments, suggestions or criticism related to the methodology I used for analyzing the Synthetic PDF Testset.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation&quot;&gt;Github repo with analysis scripts and raw data&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/scripts/jhove-verapdf-validation-run.py&quot;&gt;Script for running JHOVE and VeraPDF&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/scripts/jhove-verapdf-validation-analyze.py&quot;&gt;Analysis script (contingency tables + Cramér’s V calculation&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.radar-service.eu/radar/en/dataset/JtlOdwQquZWDqQdq&quot;&gt;Synthetic PDF Testset for File Format Validation&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://phaidra.univie.ac.at/detail/o:931074&quot;&gt;A PDF Test-Set for Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly&lt;/a&gt; - 2017 iPRES paper by Lindlar, Tunnat &amp;amp; Wilson&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://doi.org/10.2218/ijdc.v12i2.578&quot;&gt;How Valid is your Validation? A Closer Look Behind the Curtain of JHOVE&lt;/a&gt; - 2017 International Journal of Digital Curation paper by Lindlar &amp;amp; Tunnat.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Actually, feature extraction is not really needed for the purpose of this analysis. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;One PDF caused an exception in JHOVE, which resulted in the “unknown” validation status. For convenience I recoded this to “Not well-formed” for all analyses here. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Using &lt;a href=&quot;https://doi.org/10.1016%2Fj.jkss.2012.10.002&quot;&gt;Bergsma’s bias-corrected version&lt;/a&gt; (&lt;a href=&quot;https://stats.lse.ac.uk/bergsma/pdf/cramerV3.pdf&quot;&gt;non-paywalled copy&lt;/a&gt;), following the implemention given &lt;a href=&quot;https://stackoverflow.com/questions/20892799/using-pandas-calculate-cram%c3%a9rs-coefficient-matrix/39266194#39266194&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;Column “Adobe Professional XI Pro (11.0.15) - can file be opened?”. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;The only difference is one file in the “Renders normally” category that resulted in a warning, without resulting in any parse errors. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;As a test I re-calculated the statistics after lumping JHOVE’s ‘Well-Formed, but not valid’ and “Not well-formed” classes. This did result in a slightly higher value of &lt;em&gt;V&lt;/em&gt; (0.28). Lumping JHOVE’s “Well-Formed, but not valid” and “Well-Formed and valid” classes raised &lt;em&gt;V&lt;/em&gt; even further (0.32), but this is still less than the values for the VeraPDF metrics. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;And to a lesser extent VeraPDF’s “No warnings” class. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2023/06/29/verapdf-parse-status-as-a-proxy-for-rendering</link>
                <guid>https://bitsgalore.org/2023/06/29/verapdf-parse-status-as-a-proxy-for-rendering</guid>
                <pubDate>2023-06-29T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Identification of PDF preservation risks with VeraPDF and JHOVE</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/05/Rock_em_Sock_em_Robots_Game.jpg&quot; alt=&quot;Photo of a red toy robot and a similar looking blue toy robot in a boxing ring. Both robots face each other in a threatening stance.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Rock_%27em_Sock_%27em_Robots_Game.jpg&quot;&gt;&quot;Rock &apos;em Sock &apos;em Robots Game&quot;&lt;/a&gt; by  Lorie Shaull, used under &lt;a href=&quot;https://creativecommons.org/licenses/by-sa/4.0/deed.en&quot;&gt;CC BY-SA 4.0&lt;/a&gt;, via Wikimedia Commons.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. In this post I’ll review to what extent such features can be detected using &lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt; and &lt;a href=&quot;https://jhove.openpreservation.org/&quot;&gt;JHOVE&lt;/a&gt;. It further builds on earlier work I did on this subject between 2012 and 2017.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;context-of-this-work&quot;&gt;Context of this work&lt;/h2&gt;

&lt;p&gt;In 2019 the KB published its current &lt;a href=&quot;https://www.kb.nl/en/file-download/download/public/845&quot;&gt;preservation policy&lt;/a&gt; for its digital collections. The policy expresses a number of preservation-related goals for the upcoming years. One of these is a (gradual) move from pure &lt;a href=&quot;https://blogs.loc.gov/thesignal/2011/09/b-is-for-bit-preservation/&quot;&gt;bit preservation&lt;/a&gt; towards full functional preservation. To this end, the KB has defined three “knowledge levels”, which are linked to file formats&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;For &lt;strong&gt;stored&lt;/strong&gt; formats, only bit preservation is done, without any formal format identification (no PRONOM identifier).&lt;/li&gt;
  &lt;li&gt;For &lt;strong&gt;identified&lt;/strong&gt; formats, formal format identification has resulted in a PRONOM identifier.&lt;/li&gt;
  &lt;li&gt;For &lt;strong&gt;known&lt;/strong&gt; formats, format validation has been performed, and technical metadata have been extracted. This information can then be used to identify preservation-related risks at the level of individual files, and actions to mitigate these risks can be taken if necessary.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The move from bit preservation to functional preservation involves raising the “knowledge level” status of the formats in our digital collections from “stored” (which is the current situation) to “identified” or, ideally, “known”. This development coincides with the ongoing migration of these collections to our new &lt;a href=&quot;https://www.kb.nl/en/actueel/nieuws/dutch-national-library-steps-new-future-digital-archiving&quot;&gt;Rosetta archiving system&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;pdf-preservation-risks&quot;&gt;PDF preservation risks&lt;/h2&gt;

&lt;p&gt;Being one of the &lt;a href=&quot;/2015/04/29/top-50-file-formats-in-the-kb-e-depot&quot;&gt;most prevalent formats&lt;/a&gt; in our collections, it comes as no surprise that PDF is one of the first formats for which we want to raise the current knowledge level. This is why my colleagues at our digital preservation department asked me take a new dive into PDF features that are potential preservation risks, and, more importantly, how to detect such features with existing software tools.&lt;/p&gt;

&lt;p&gt;As long-term readers may remember, risk detection in PDF isn’t an unfamiliar subject to me. Between 2012 and 2014 I experimented with detecting “risky” PDF features using the open-source &lt;a href=&quot;https://pdfbox.apache.org/&quot;&gt;Apache PDFBox&lt;/a&gt; software&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. This work was done as part of the &lt;a href=&quot;https://scape-project.eu/&quot;&gt;SCAPE project&lt;/a&gt;. In 2017 I &lt;a href=&quot;/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression&quot;&gt;repeated some of the SCAPE-era experiments&lt;/a&gt;, this time using &lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt;. For various reasons that work never saw a proper follow-up, but six years onwards it’s finally on the agenda once again! In the meantime others have produced relevant related work as well. The &lt;a href=&quot;https://docs.google.com/spreadsheets/d/1eW7R8yACBciNimr16Z2ptC7fs1FlmZMnzdtG_DHBuD4/edit?usp=sharing&quot;&gt;PDF Significant Properties Spreadsheet&lt;/a&gt; by Tyler Thorsted (currently at Brigham Young University) deserves a special mention here. It gives an overview of metadata fields and technical properties that are reported by 12 different software tools.&lt;/p&gt;

&lt;h2 id=&quot;pdf-features-and-risks&quot;&gt;PDF features and risks&lt;/h2&gt;

&lt;p&gt;Before going any further, it’s important to be clear about our definition of preservation risks, and how they are related to PDF features. For this analysis I largely followed the earlier work that I mentioned above as a starting point. This means the focus is mostly on the following features and their associated risks:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Encryption and security-related features. The associated risks are restricted functionality (e.g. text access, copying, printing), or files that are inaccessible altogether (because they can only be opened using a password).&lt;/li&gt;
  &lt;li&gt;External dependencies, such as fonts that are not embedded, or dependencies on external documents. The associated risk is that that documents may not render as originally intended if the external resources are not available.&lt;/li&gt;
  &lt;li&gt;Multimedia features, including 3-D content. The associated risk is that the multimedia content may not render without dedicated software.&lt;/li&gt;
  &lt;li&gt;File attachments. The associated risk is that attached files may require external software to render.&lt;/li&gt;
  &lt;li&gt;JavaScript. This presents various risks, which are mostly security-related.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This list is not exhaustive, but it provides a useful starting point. Deviations from the format specification (PDF validity) can also pose potential preservation risks. &lt;a href=&quot;https://blog.dshr.org/2009/01/postels-law.html&quot;&gt;Opinions are divided&lt;/a&gt; about the importance of format validity in actual practice. PDF format validation is a huge subject in its own right&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, which is out of the scope of the current analysis.&lt;/p&gt;

&lt;h2 id=&quot;verapdf-vs-jhove&quot;&gt;VeraPDF vs JHOVE&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression&quot;&gt;2017 analysis&lt;/a&gt; already showed that VeraPDF was able to detect most risk-associated PDF features, which made the inclusion of VeraPDF an obvious choice for this follow-up. VeraPDF is currently not included in Rosetta, and my colleagues wondered to what extent &lt;a href=&quot;https://jhove.openpreservation.org/&quot;&gt;JHOVE&lt;/a&gt; (which is part of Rosetta’s default setup) would be up to the job. So, in the remainder of this blog post I will present a comparison of VeraPDF and JHOVE.&lt;/p&gt;

&lt;p&gt;It’s important to mention that VeraPDF and JHOVE have different (but partially overlapping) scopes and functionalities. VeraPDF was primarily written to test for conformance against the various &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/A&quot;&gt;PDF/A&lt;/a&gt; profiles, with recent versions also supporting (partial) conformance testing for &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/UA&quot;&gt;PDF/UA&lt;/a&gt;. It does not (or at best only to a limited degree) validate the lower-level data structures that make up a PDF file. The latter is the primary domain of JHOVE’s PDF module, which was designed for “full on” PDF validation&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. The main overlap between VeraPDF and JHOVE lies in their ability to extract metadata and technical features, and for the current analysis I have only looked at this particular functionality of both tools.&lt;/p&gt;

&lt;p&gt;The following table shows the evaluated VeraPDF and JHOVE versions:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Software&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Version&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;VeraPDF&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.22.3&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JHOVE&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.28.0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;By default, VeraPDF’s feature extraction mode only extracts metadata from a PDF’s document information dictionary. This is not sufficient for the purpose of this analysis, so I updated VeraPDF’s configuration and enabled &lt;em&gt;all&lt;/em&gt; feature extraction types&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;methods-and-test-data&quot;&gt;Methods and test data&lt;/h2&gt;

&lt;p&gt;I used 3 different data sets for this analysis. First, used the &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/pdfCabinetOfHorrors&quot;&gt;PDF Cabinet of Horrors corpus&lt;/a&gt;. This is a small, hand-curated corpus of PDFs that is part of the &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master&quot;&gt;Open Preservation Foundation’s file format corpus&lt;/a&gt;. Since this is an annotated dataset with files that have known features (such as security-related features, non-embedded fonts, and multimedia), it provides excellent ground truth. I ran VeraPDF and JHOVE on all files in this data set, and scrutinised the resulting output files in detail. The main goal of this part of the analysis (which is presented in detail in the next section) was twofold:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Establish to what extent VeraPDF and JHOVE are able to detect the known features in these files.&lt;/li&gt;
  &lt;li&gt;Identify broad patterns in VeraPDF’s and JHOVE’s output that are associated with potential preservation risks.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The results then served as input for a second test, which is based on a subset of files from the (now defunct) &lt;a href=&quot;https://web.archive.org/web/20130503115947/http://acroeng.adobe.com/wp/&quot;&gt;Adobe Acrobat Engineering website&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally, I tested how well VeraPDF and JHOVE are able to handle PDF 2.0 documents by running them on the PDF Association’s &lt;a href=&quot;https://github.com/pdf-association/pdf20examples&quot;&gt;PDF 2.0 examples&lt;/a&gt; dataset.&lt;/p&gt;

&lt;h2 id=&quot;advance-warning&quot;&gt;Advance warning&lt;/h2&gt;

&lt;p&gt;The level of detail in the following sections may be too much for some (most?) readers to digest, except perhaps for the most hardcore PDF freaks (you know who you are!). This applies in particular to the “Horror Corpus” section. If this is the case, you may want to skip right to the “Discussion” section from here. Otherwise, now is the time to fasten your seatbelts!&lt;/p&gt;

&lt;h2 id=&quot;analysis-of-horror-corpus&quot;&gt;Analysis of Horror Corpus&lt;/h2&gt;

&lt;p&gt;I will start with a rather detailed analysis of the &lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/pdfCabinetOfHorrors&quot;&gt;Horror Corpus&lt;/a&gt;, as this is an annotated corpus of files with known features. Each of the following sub-sections covers one “risky” feature, which is demonstrated by one or more files in the Horror Corpus. For each feature, I show how it is represented in VeraPDF’s and JHOVE’s output (if it is represented at all). Even though this only covers a limited selection of “risky” features, this analysis is useful to get a first impression of the differences between VerapDF and JHOVE. It also enables us to identify which parts of the output of both tools are of potential interest for further perusal.&lt;/p&gt;

&lt;h3 id=&quot;open-password&quot;&gt;Open password&lt;/h3&gt;

&lt;h4 id=&quot;test-files&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/encryption_openpassword.pdf?raw=true&quot;&gt;encryption_openpassword.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Opening the document requires password.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2015/07/openpassword.png&quot; alt=&quot;Screenshot of dialog window that asks to enter a Document Open password.&quot; /&gt;
  &lt;figcaption&gt;Adobe Acrobat dialog on opening a PDF with an open password.&lt;/figcaption&gt; 
&lt;/figure&gt;

&lt;h4 id=&quot;verapdf&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;For this file, VeraPDF reports an exception (full output &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/encryption_openpassword-vera.xml&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;taskResult&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PARSE&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;isExecuted=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;isSuccess=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;duration&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;start=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1683818755295&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;finish=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1683818755343&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;00:00:00.048&lt;span class=&quot;nt&quot;&gt;&amp;lt;/duration&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;exceptionMessage&amp;gt;&lt;/span&gt;Exception: The PDF stream appears to be encrypted. caused by exception: Reader::init(...)encrypted pdf is not supported&lt;span class=&quot;nt&quot;&gt;&amp;lt;/exceptionMessage&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/taskResult&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;jhove&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;Unlike VeraPDF, JHOVE appears to be able to parse the file in spite of the open password. Its &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/encryption_openpassword-jhove.xml&quot;&gt;output&lt;/a&gt; contains an “Encryption” element with several child elements:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Encryption&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Property&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;SecurityHandler&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Standard&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;EFF&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Standard&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Algorithm&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Document-defined&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;KeyLength&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Integer&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;128&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;StandardSecurityHandler&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Property&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;UserAccess&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Print&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Modify&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Extract&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Add/modify annotations/forms&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Fill interactive form fields&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Extract for accessibility&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Print high quality&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Revision&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Integer&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;4&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;OwnerString&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;0x03a59d10aae3b50f1a30c34fbb1be09ababfc1fb19f0d3491ee84c1752671918&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;UserString&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;0xf240da93aa83bb9f336cf249812c4db700000000000000000000000000000000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
 &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Even though it’s helpful that JHOVE shows that this file contains security-related features, the above output doesn’t provide any direct clue of the more specific open password feature&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. This is unfortunate, given that open passwords are one of the most serious PDF preservation risks.&lt;/p&gt;

&lt;h3 id=&quot;copy-printing-and-text-access-passwords&quot;&gt;Copy, printing and text access passwords&lt;/h3&gt;

&lt;h4 id=&quot;test-files-1&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/encryption_nocopy.pdf?raw=true&quot;&gt;encryption_nocopy.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Copying document contents requires password.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/encryption_noprinting.pdf?raw=true&quot;&gt;encryption_noprinting.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Printing requires password.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/encryption_notextaccess.pdf?raw=true&quot;&gt;encryption_notextaccess.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Text access (e.g. by a screen reader) requires password.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2015/07/printpassword.png&quot; alt=&quot;Screenshot of Adobe Acrobat showing a &apos;you cannot print this document&apos; message.&quot; /&gt;
  &lt;figcaption&gt;Adobe Acrobat&apos;s rendering of a PDF with that requires a password for printing.&lt;/figcaption&gt; 
&lt;/figure&gt;

&lt;h4 id=&quot;verapdf-1&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;Information on usage restrictions can be found in the &lt;em&gt;documentSecurity&lt;/em&gt; element of VeraPDF’s output. Below is an example (taken from the &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/encryption_nocopy-vera.xml&quot;&gt;output&lt;/a&gt; of the copy-protected file):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;documentSecurity&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;filter&amp;gt;&lt;/span&gt;Standard&lt;span class=&quot;nt&quot;&gt;&amp;lt;/filter&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;version&amp;gt;&lt;/span&gt;4&lt;span class=&quot;nt&quot;&gt;&amp;lt;/version&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;length&amp;gt;&lt;/span&gt;128&lt;span class=&quot;nt&quot;&gt;&amp;lt;/length&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;ownerKey&amp;gt;&lt;/span&gt;5F7FA03AD5A40C66...&lt;span class=&quot;nt&quot;&gt;&amp;lt;/ownerKey&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;userKey&amp;gt;&lt;/span&gt;A572ACDFF8B3DC74...&lt;span class=&quot;nt&quot;&gt;&amp;lt;/userKey&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;encryptMetadata&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/encryptMetadata&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;printAllowed&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/printAllowed&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;printDegradedAllowed&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/printDegradedAllowed&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;changesAllowed&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/changesAllowed&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;modifyAnnotationsAllowed&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/modifyAnnotationsAllowed&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;fillingSigningAllowed&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fillingSigningAllowed&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;documentAssemblyAllowed&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/documentAssemblyAllowed&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;extractContentAllowed&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/extractContentAllowed&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;extractAccessibilityAllowed&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/extractAccessibilityAllowed&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/documentSecurity&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the “false” value of the &lt;em&gt;extractContentAllowed&lt;/em&gt; element indicates this file is copy-protected. The &lt;em&gt;printAllowed&lt;/em&gt; and &lt;em&gt;printDegradedAllowed&lt;/em&gt; elements indicate whether printing is allowed (example &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/encryption_noprinting-vera.xml&quot;&gt;here&lt;/a&gt;), and “false” values of &lt;em&gt;extractContentAllowed&lt;/em&gt; and &lt;em&gt;extractAccessibilityAllowed&lt;/em&gt; indicate a text access password (example &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/encryption_notextaccess-vera.xml&quot;&gt;here&lt;/a&gt;).&lt;/p&gt;

&lt;h4 id=&quot;jhove-1&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;For JHOVE, the &lt;em&gt;UserAccess&lt;/em&gt; property (which is a child element of the &lt;em&gt;Encryption&lt;/em&gt; property) provides information about usage restrictions. Below for the copy-restricted document (full output &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/encryption_nocopy-jhove.xml&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;UserAccess&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Print&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Modify&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Add/modify annotations/forms&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Fill interactive form fields&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Extract for accessibility&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Print high quality&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And here for the print-restricted document (full output &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/encryption_noprinting-jhove.xml&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;UserAccess&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Modify&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Extract&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Add/modify annotations/forms&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Fill interactive form fields&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Extract for accessibility&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The above examples show that JHOVE only reports user access types that are &lt;em&gt;permitted&lt;/em&gt;. This means that if we want to verify if a document is print-restricted, we need to check for the &lt;em&gt;absence&lt;/em&gt; of the values &lt;em&gt;Print&lt;/em&gt; and &lt;em&gt;Print high quality&lt;/em&gt;. This is impractical, because it assumes the user has &lt;em&gt;a priori&lt;/em&gt; knowledge of all possible values that JHOVE is capable of reporting (which are also undocumented). VeraPDF’s makes this considerably easier by always reporting &lt;em&gt;all&lt;/em&gt; user access types and their respective values.&lt;/p&gt;

&lt;h3 id=&quot;multimedia&quot;&gt;Multimedia&lt;/h3&gt;

&lt;h4 id=&quot;test-files-2&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/embedded_video_quicktime.pdf?raw=true&quot;&gt;embedded_video_quicktime.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains embedded Quicktime movie.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/embedded_video_avi.pdf?raw=true&quot;&gt;embedded_video_avi.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains embedded AVI movie.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2015/07/embeddquicktime.png&quot; alt=&quot;Screenshot of Adobe Acrobat&apos;s rendering of a PDF with an embedded Quicktime movie. As Acrobat cannot play multimedia content natively, it shows a dialog saying &apos;The media requires an additional player. Please click Get Media Player to download the correct media player&apos;.&quot; /&gt;
  &lt;figcaption&gt;Adobe Acrobat&apos;s rendering of a PDF with an embedded Quicktime movie.&lt;/figcaption&gt; 
&lt;/figure&gt;

&lt;h4 id=&quot;verapdf-2&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;For both files the presence of the multimedia content is indicated by two elements in &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/embedded_video_quicktime-vera.xml&quot;&gt;VeraPDF’s output&lt;/a&gt;. First, the &lt;em&gt;actions&lt;/em&gt; element contains a reference to a Rendition action:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;action&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Rendition&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;location&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/location&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/action&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Furthermore, the &lt;em&gt;annotations&lt;/em&gt; element contains a reference to a Screen annotation:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;annotation&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;annotIndir35&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;subType&amp;gt;&lt;/span&gt;Screen&lt;span class=&quot;nt&quot;&gt;&amp;lt;/subType&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;rectangle&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;lly=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;360.647&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;llx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;73.180&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;urx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;393.180&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;ury=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;600.647&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/rectangle&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;width&amp;gt;&lt;/span&gt;320.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/width&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;height&amp;gt;&lt;/span&gt;240.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/height&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;resources&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;xobject&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;xobjIndir39&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/xobject&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/resources&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;invisible&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/invisible&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;hidden&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/hidden&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;print&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/print&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;noZoom&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noZoom&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;noRotate&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noRotate&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;noView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noView&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;readOnly&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/readOnly&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;locked&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/locked&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;toggleNoView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/toggleNoView&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;lockedContents&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/lockedContents&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/annotation&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;jhove-2&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/embedded_video_quicktime-jhove.xml&quot;&gt;JHOVE output&lt;/a&gt; only refers to the Screen annotation for these files (JHOVE doesn’t report on actions):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Property&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Subtype&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Screen&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
        ::
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Besides Rendition actions and Screen annotations, there are several other action and annotation types that indicate multimedia content, and we’ll see some examples later on in the analysis of the Adobe Acrobat Engineering files.&lt;/p&gt;

&lt;h3 id=&quot;javascript&quot;&gt;JavaScript&lt;/h3&gt;

&lt;h4 id=&quot;test-files-3&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/javascript.pdf?raw=true&quot;&gt;javascript.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains JavaScript.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h4 id=&quot;verapdf-3&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;The &lt;em&gt;actions&lt;/em&gt; element of &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/javascript-vera.xml&quot;&gt;VeraPDF’s output&lt;/a&gt; contains the following JavaScript action reference&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;action&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;JavaScript&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;location&amp;gt;&lt;/span&gt;Document&lt;span class=&quot;nt&quot;&gt;&amp;lt;/location&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/action&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;jhove-3&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;I couldn’t find any JavaScript reference in the &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/javascript-jhove.xml&quot;&gt;JHOVE output&lt;/a&gt; for the same file.&lt;/p&gt;

&lt;h3 id=&quot;font-embedding&quot;&gt;Font embedding&lt;/h3&gt;

&lt;h4 id=&quot;test-files-4&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/text_only_fontsNotEmbedded.pdf?raw=true&quot;&gt;text_only_fontsNotEmbedded.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Only uses fonts that are not embedded.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/text_only_fontsEmbeddedAll.pdf?raw=true&quot;&gt;text_only_fontsEmbeddedAll.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Only uses fonts that are embedded.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2015/07/fontslinux.png&quot; alt=&quot;Screenshot of Adobe Acrobat rendering a PDF with a font whose glyphs are placed at odd positions.&quot; /&gt;
  &lt;figcaption&gt;Adobe Acrobat&apos;s rendering of a PDF that uses non-embedded font that is not available on the target machine.&lt;/figcaption&gt; 
&lt;/figure&gt;

&lt;h4 id=&quot;verapdf-4&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;The &lt;em&gt;fonts&lt;/em&gt; element in VeraPDF’s output contains child elements for each font that is used in a document. For the PDF that only uses fonts that are not embedded, this results in the following output (full output file &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/text_only_fontsNotEmbedded-vera.xml&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;font&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;fntIndir30&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;type&amp;gt;&lt;/span&gt;TrueType&lt;span class=&quot;nt&quot;&gt;&amp;lt;/type&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;baseFont&amp;gt;&lt;/span&gt;TimesNewRomanPSMT&lt;span class=&quot;nt&quot;&gt;&amp;lt;/baseFont&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;firstChar&amp;gt;&lt;/span&gt;32&lt;span class=&quot;nt&quot;&gt;&amp;lt;/firstChar&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;lastChar&amp;gt;&lt;/span&gt;255&lt;span class=&quot;nt&quot;&gt;&amp;lt;/lastChar&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;encoding&amp;gt;&lt;/span&gt;WinAnsiEncoding&lt;span class=&quot;nt&quot;&gt;&amp;lt;/encoding&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;fontDescriptor&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;subset&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/subset&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;fontName&amp;gt;&lt;/span&gt;TimesNewRomanPSMT&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fontName&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;fontFamily&amp;gt;&lt;/span&gt;Times New Roman&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fontFamily&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;fontStretch&amp;gt;&lt;/span&gt;Normal&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fontStretch&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;fontWeight&amp;gt;&lt;/span&gt;400.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fontWeight&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;fixedPitch&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fixedPitch&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;serif&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/serif&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;symbolic&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/symbolic&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;script&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;nonsymbolic&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/nonsymbolic&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;italic&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/italic&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;allCap&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/allCap&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;smallCap&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/smallCap&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;forceBold&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/forceBold&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;fontBBox&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;lly=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;-307.000&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;llx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;-568.000&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;urx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2000.000&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;ury=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1007.000&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/fontBBox&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;italicAngle&amp;gt;&lt;/span&gt;0.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/italicAngle&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;ascent&amp;gt;&lt;/span&gt;891.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/ascent&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;descent&amp;gt;&lt;/span&gt;-216.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/descent&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;leading&amp;gt;&lt;/span&gt;0.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/leading&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;capHeight&amp;gt;&lt;/span&gt;1000.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/capHeight&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;xHeight&amp;gt;&lt;/span&gt;1000.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/xHeight&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;stemV&amp;gt;&lt;/span&gt;82.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/stemV&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;stemH&amp;gt;&lt;/span&gt;0.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/stemH&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;averageWidth&amp;gt;&lt;/span&gt;0.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/averageWidth&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;maxWidth&amp;gt;&lt;/span&gt;0.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/maxWidth&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;missingWidth&amp;gt;&lt;/span&gt;0.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/missingWidth&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;embedded&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/embedded&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fontDescriptor&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/font&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the value of &lt;em&gt;embedded&lt;/em&gt; in the &lt;em&gt;fontDescriptor&lt;/em&gt; sub-element indicates whether a font is embededded or not (in this example it isn’t). So, to check if a PDF uses fonts that are not embedded, one can simply iterate over all &lt;em&gt;font&lt;/em&gt; elements and check the value of &lt;em&gt;/fontDescriptor/embedded&lt;/em&gt; for each of these.&lt;/p&gt;

&lt;h4 id=&quot;jhove-4&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/text_only_fontsNotEmbedded-jhove.xml&quot;&gt;JHOVE output for the same PDF&lt;/a&gt; does contain several font-related properties, but none of them are related to embedding. So, I initially assumed that JHOVE simply didn’t report this. However, when I looked at &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/text_only_fontsEmbeddedAll-jhove.xml&quot;&gt;JHOVE’s output&lt;/a&gt; for the PDF with only embedded fonts, I noticed this property (part of a &lt;em&gt;FontDescriptor&lt;/em&gt; element):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;FontFile2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Boolean&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Since JHOVE doesn’t report the &lt;em&gt;FontFile2&lt;/em&gt; property for the PDF without embedded fonts, I wondered if this was a coincidence. As the meaning of this property (or any of JHOVE’s properties for that matter) is not documented, I took a peek at &lt;a href=&quot;https://github.com/openpreserve/jhove/blob/94da570caa55759354fa6fcd50e4ea7edbba1e7d/jhove-modules/pdf-hul/src/main/java/edu/harvard/hul/ois/jhove/module/PdfModule.java#L3830&quot;&gt;JHOVE’s source code&lt;/a&gt;. This revealed that the &lt;em&gt;FontFile2&lt;/em&gt; property originates from the “font descriptor” dictionary, which is documented in section 9.8 (Font Descriptors) of the &lt;a href=&quot;https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf&quot;&gt;ISO 32000-1 PDF specification&lt;/a&gt;. Here we can see (Table 122) that the keys &lt;em&gt;FontFile&lt;/em&gt;, &lt;em&gt;FontFile2&lt;/em&gt; and &lt;em&gt;FontFile3&lt;/em&gt; (which also exist as separate JHOVE properties) all indicate embedded fonts.&lt;/p&gt;

&lt;p&gt;This means we can actually use JHOVE’s output to check for font embedding, but to identify a PDF with one or more fonts that are not embedded, one needs to iterate over all font property groups, and then check for the &lt;em&gt;absence&lt;/em&gt; of either of three separate properties (&lt;em&gt;FontFile&lt;/em&gt;, &lt;em&gt;FontFile2&lt;/em&gt; and &lt;em&gt;FontFile3&lt;/em&gt;). None of these properties are documented. This is not ideal to begin with, and made worse by JHOVE’s rather labyrinthine output format.&lt;/p&gt;

&lt;h3 id=&quot;file-attachments&quot;&gt;File attachments&lt;/h3&gt;

&lt;h4 id=&quot;test-files-5&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/fileAttachment.pdf?raw=true&quot;&gt;fileAttachment.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains file attachment that uses EmbeddedFiles entry.&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/fileAttachment_fileAttachmentAnnotation.pdf?raw=true&quot;&gt;fileAttachment_fileAttachmentAnnotation.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains file attachment that uses File Attachment Annotation.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;File attachments in PDF can be implemented in two different ways, using either an &lt;em&gt;EmbeddedFiles&lt;/em&gt; entry in the document’s name dictionary, or a File Attachment Annotation.&lt;/p&gt;

&lt;h4 id=&quot;verapdf-5&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;For the sample file with the &lt;em&gt;EmbeddedFiles&lt;/em&gt; entry, &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/fileAttachment-vera.xml&quot;&gt;VeraPDF’s output&lt;/a&gt; contains an &lt;em&gt;embeddedFiles&lt;/em&gt; element with one or more child elements:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;embeddedFiles&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;embeddedFile&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;file1&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;fileName&amp;gt;&lt;/span&gt;KSBASE.WQ2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fileName&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;description&amp;gt;&amp;lt;/description&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;filter&amp;gt;&lt;/span&gt;FlateDecode&lt;span class=&quot;nt&quot;&gt;&amp;lt;/filter&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;creationDate&amp;gt;&lt;/span&gt;2012-11-23T15:40:38.000+01:00&lt;span class=&quot;nt&quot;&gt;&amp;lt;/creationDate&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;modDate&amp;gt;&lt;/span&gt;2012-11-19T12:35:10.000Z&lt;span class=&quot;nt&quot;&gt;&amp;lt;/modDate&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;checkSum&amp;gt;&lt;/span&gt;)#x000002Aﬁ˛½�ﬂô#x000004‘SèŠ-¡&lt;span class=&quot;nt&quot;&gt;&amp;lt;/checkSum&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;size&amp;gt;&lt;/span&gt;20668&lt;span class=&quot;nt&quot;&gt;&amp;lt;/size&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/embeddedFile&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/embeddedFiles&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the sample file with the File Attachment Annotation, the &lt;em&gt;annotations&lt;/em&gt; element in VeraPDF’s &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/fileAttachment_fileAttachmentAnnotation-vera.xml&quot;&gt;output&lt;/a&gt; contains an &lt;em&gt;annotation&lt;/em&gt; child element that has “FileAttachment” as its subtype:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;annotation&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;annotIndir26&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;subType&amp;gt;&lt;/span&gt;FileAttachment&lt;span class=&quot;nt&quot;&gt;&amp;lt;/subType&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;rectangle&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;lly=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;562.840&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;llx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;185.281&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;urx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;199.281&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;ury=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;582.840&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/rectangle&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;width&amp;gt;&lt;/span&gt;14.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/width&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;height&amp;gt;&lt;/span&gt;20.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/height&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;contents&amp;gt;&lt;/span&gt;PF.WK1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/contents&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;annotationName&amp;gt;&lt;/span&gt;9a716f25-da75-4309-b918-483cfa9f6473&lt;span class=&quot;nt&quot;&gt;&amp;lt;/annotationName&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;modifiedDate&amp;gt;&lt;/span&gt;D:20131024143706+02&apos;00&apos;&lt;span class=&quot;nt&quot;&gt;&amp;lt;/modifiedDate&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;resources&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;xobject&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;xobjIndir29&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/xobject&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/resources&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;color&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;red=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0.250000&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;green=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0.333328&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;blue=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1.000000&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/color&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;invisible&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/invisible&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;hidden&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/hidden&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;print&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/print&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;noZoom&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noZoom&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;noRotate&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noRotate&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;noView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noView&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;readOnly&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/readOnly&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;locked&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/locked&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;toggleNoView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/toggleNoView&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;lockedContents&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/lockedContents&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/annotation&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;jhove-5&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;For the sample file with the &lt;em&gt;EmbeddedFiles&lt;/em&gt; entry, &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/fileAttachment-jhove.xml&quot;&gt;JHOVE’s output&lt;/a&gt; contains a &lt;em&gt;PageMode&lt;/em&gt; element with &lt;em&gt;UseAttachments&lt;/em&gt; as one of its values:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;PageMode&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;UseAttachments&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the sample file with the File Attachment Annotation, JHOVE’s &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/fileAttachment_fileAttachmentAnnotation-jhove.xml&quot;&gt;output&lt;/a&gt; reports a “FileAttachment” annotation:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Property&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Subtype&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;FileAttachment&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;link-to-external-file&quot;&gt;Link to external file&lt;/h3&gt;

&lt;h4 id=&quot;test-files-6&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/externalLink.pdf?raw=true&quot;&gt;externalLink.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains link to an external document.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h4 id=&quot;verapdf-6&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;For the test file, the &lt;em&gt;actions&lt;/em&gt; element of &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/externalLink-vera.xml&quot;&gt;VeraPDF’s output&lt;/a&gt; contains a “Launch” action:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;action&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Launch&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;location&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/location&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/action&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Further to that, the &lt;em&gt;annotations&lt;/em&gt; element contains a child element for a “Link” annotation:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;annotation&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;annotIndir35&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;subType&amp;gt;&lt;/span&gt;Link&lt;span class=&quot;nt&quot;&gt;&amp;lt;/subType&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;rectangle&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;lly=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;678.816&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;llx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;70.860&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;urx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;200.820&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;ury=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;694.584&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/rectangle&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;width&amp;gt;&lt;/span&gt;129.960&lt;span class=&quot;nt&quot;&gt;&amp;lt;/width&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;height&amp;gt;&lt;/span&gt;15.768&lt;span class=&quot;nt&quot;&gt;&amp;lt;/height&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;invisible&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/invisible&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;hidden&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/hidden&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;print&amp;gt;&lt;/span&gt;true&lt;span class=&quot;nt&quot;&gt;&amp;lt;/print&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;noZoom&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noZoom&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;noRotate&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noRotate&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;noView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noView&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;readOnly&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/readOnly&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;locked&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/locked&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;toggleNoView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/toggleNoView&amp;gt;&lt;/span&gt;
     &lt;span class=&quot;nt&quot;&gt;&amp;lt;lockedContents&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/lockedContents&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/annotation&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As explained in section 12.6.4.5 (Launch actions) of &lt;a href=&quot;https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf&quot;&gt;ISO 32000-1&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A launch action launches an application or opens or prints a document.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, it is not exclusively associated with external &lt;em&gt;documents&lt;/em&gt;, but it might also indicate a dependency on external &lt;em&gt;software&lt;/em&gt;, which makes it all the more relevant for preservation. Link annotations can also serve several purposes. Section 12.5.6.5 (Link annotations) of ISO 32000-1:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A link annotation represents either a hypertext link to a destination elsewhere in the document (…) or an action to be performed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If I’m reading this correctly, the presence of a link annotation alone not always imply an external dependency.&lt;/p&gt;

&lt;h4 id=&quot;jhove-6&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/externalLink-jhove.xml&quot;&gt;JHOVE’s output&lt;/a&gt; only contains this reference to the Link annotation:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;List&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Property&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;Subtype&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;values&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;arity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Scalar&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;String&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;value&amp;gt;&lt;/span&gt;Link&lt;span class=&quot;nt&quot;&gt;&amp;lt;/value&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;/values&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/property&amp;gt;&lt;/span&gt;
        ::
&lt;span class=&quot;nt&quot;&gt;&amp;lt;property&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;JHOVE doesn’t report any actions (including the Launch action in this file).&lt;/p&gt;

&lt;h3 id=&quot;web-capture-content&quot;&gt;Web Capture content&lt;/h3&gt;

&lt;h4 id=&quot;test-files-7&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/webCapture.pdf?raw=true&quot;&gt;webCapture.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains Web Capture content.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;When I created this test file back in 2012, I was under the impression that all Web Capture content by its nature had a dependency on the live web. Reading section 14.10 (Web Capture) of &lt;a href=&quot;https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf&quot;&gt;ISO 32000-1&lt;/a&gt; once more, it turns out that this is not the case:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The information in the Web Capture data structures enables conforming products to perform the following
operations:&lt;/p&gt;
  &lt;ul&gt;
    &lt;li&gt;Save locally and preserve the visual appearance of material from the Web&lt;/li&gt;
    &lt;li&gt;Retrieve additional material from the Web and add it to an existing PDF file&lt;/li&gt;
    &lt;li&gt;Update or modify existing material previously captured from the Web&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, Web Capture content may simply be a local copy of material that was captured from the web at the time of the PDF’s creation. This seems to be the case for this test file. When I open it on my machine (using &lt;a href=&quot;https://github.com/linuxmint/xreader/&quot;&gt;Xreader&lt;/a&gt;), it shows a snapshot of the Open Planets Foundation website from the time of the document’s creation (2012):&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/05/webcapture-xreader.png&quot; alt=&quot;Screenshot that shows how the test file with web capture content is rendered in Xreader.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;It’s not entirely clear to me how and under what circumstances additional material is retrieved from the web, or existing material is updated. This makes it hard to judge how much of a preservation risk Web Capture content poses in actual practice.&lt;/p&gt;

&lt;h4 id=&quot;verapdf-7&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/webCapture-vera.xml&quot;&gt;output&lt;/a&gt; of VeraPDF contains several interesting bits. First, there are these references to a GoTo action, a URI action and a SubmitForm action:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;nt&quot;&gt;&amp;lt;action&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;GoTo&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;location&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/location&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/action&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;action&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;URI&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;location&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/location&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/action&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;action&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;SubmitForm&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;location&amp;gt;&lt;/span&gt;Annotation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/location&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/action&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Moreover, the &lt;em&gt;annotations&lt;/em&gt; element contains a large number of Link annotations:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;annotation&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;annotIndir163&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;subType&amp;gt;&lt;/span&gt;Link&lt;span class=&quot;nt&quot;&gt;&amp;lt;/subType&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;rectangle&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;lly=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;720.000&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;llx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;11.000&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;urx=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;88.000&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;ury=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;760.000&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/rectangle&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;width&amp;gt;&lt;/span&gt;77.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/width&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;height&amp;gt;&lt;/span&gt;40.000&lt;span class=&quot;nt&quot;&gt;&amp;lt;/height&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;invisible&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/invisible&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;hidden&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/hidden&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;print&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/print&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;noZoom&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noZoom&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;noRotate&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noRotate&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;noView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/noView&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;readOnly&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/readOnly&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;locked&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/locked&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;toggleNoView&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/toggleNoView&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;lockedContents&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/lockedContents&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/annotation&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h4 id=&quot;jhove-7&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;Looking at &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/webCapture-jhove.xml&quot;&gt;JHOVE’s output&lt;/a&gt;, I couldn’t find any reference to the Web Capture content at all. The missing references to the GoTo action, URI action and SubmitForm action are understandable, since JHOVE doesn’t report any actions. The absence of any reference to the numerous Link annotations is nevertheless surprising.&lt;/p&gt;

&lt;h3 id=&quot;byte-corruption&quot;&gt;Byte corruption&lt;/h3&gt;

&lt;h4 id=&quot;test-files-8&quot;&gt;Test files&lt;/h4&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/corruptionOneByteMissing.pdf?raw=true&quot;&gt;corruptionOneByteMissing.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Has one byte of missing data, immediately following the file header.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h4 id=&quot;verapdf-8&quot;&gt;VeraPDF&lt;/h4&gt;

&lt;p&gt;The byte corruption in this file causes an exception in VeraPDF. Details can be found in the &lt;em&gt;taskResult&lt;/em&gt; element of the &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/corruptionOneByteMissing-vera.xml&quot;&gt;output&lt;/a&gt; file:&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;taskResult&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PARSE&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;isExecuted=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;isSuccess=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;duration&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;start=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1683818745078&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;finish=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1683818745121&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;00:00:00.043&lt;span class=&quot;nt&quot;&gt;&amp;lt;/duration&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;exceptionMessage&amp;gt;&lt;/span&gt;Exception: Couldn&apos;t parse stream caused by exception: Pages not found&lt;span class=&quot;nt&quot;&gt;&amp;lt;/exceptionMessage&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/taskResult&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It looks like the &lt;em&gt;taskResult&lt;/em&gt; element is currently &lt;em&gt;only&lt;/em&gt; included if an exception occurs. It would be more consistent to include it for &lt;em&gt;every&lt;/em&gt; processsed file. This would make it more straightforward to use the output to identify PDFS that caused any exceptions or parse errors in VeraPDF. These are usually an indication that something is seriously wrong with a file, and the value of the &lt;em&gt;isSuccess&lt;/em&gt; attribute could be used as a rough validation proxy. I’ve created &lt;a href=&quot;https://github.com/veraPDF/veraPDF-library/issues/1336&quot;&gt;an issue&lt;/a&gt; for this.&lt;/p&gt;

&lt;h4 id=&quot;jhove-8&quot;&gt;JHOVE&lt;/h4&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/horror/corruptionOneByteMissing-jhove.xml&quot;&gt;JHOVE’s output&lt;/a&gt; contains the following error:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;message&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;offset=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;severity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;error&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDF-HUL-140&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Document catalog dictionary object number and trailer root ref number are inconsistent.&lt;span class=&quot;nt&quot;&gt;&amp;lt;/message&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Since this test file only represents one very specific case of byte corruption, I expect that other types might result in quite different JHOVE errors (but this is outside the scope of this investigation).&lt;/p&gt;

&lt;h2 id=&quot;analysis-adobe-acrobat-engineering-dataset&quot;&gt;Analysis Adobe Acrobat Engineering dataset&lt;/h2&gt;

&lt;p&gt;The results of the Horror Corpus analysis highlight the significance of actions and annotations, which are often indicative of features that are associated with preservation risks. The Horror Corpus analysis already showed that for one test file, annotations that were correctly reported by VeraPDF, were subsequently not picked up by JHOVE. I did some further tests to determine whether this was an isolated case, or perhaps an indication of a more structural problem. For these tests I used the files in the &lt;a href=&quot;https://web.archive.org/web/20130726144923/http://acroeng.adobe.com/wp/?page_id=61&quot;&gt;“Classic Multimedia”&lt;/a&gt; category of the Adobe Acrobat Engineering dataset. These files make heavy use of actions and annotations, which makes them well suited for this purpose.&lt;/p&gt;

&lt;h3 id=&quot;results&quot;&gt;Results&lt;/h3&gt;

&lt;p&gt;The full VeraPDF and JHOVE output for these files can be found &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/tree/main/output/ae-multimedia&quot;&gt;here&lt;/a&gt;. The following table summarizes the actions and annotations that are reported by VeraPDF and JHOVE:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Actions (VeraPDF)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Annotations (VeraPDF)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Annotations (JHOVE)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AdobeChassisDemo-commented_Review.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JavaScript&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;br /&gt;Ink&lt;br /&gt;Text&lt;br /&gt;Popup&lt;br /&gt;Polygon&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Popup&lt;br /&gt;Ink&lt;br /&gt;Text&lt;br /&gt;3D&lt;br /&gt;Polygon&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;movie.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Movie&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;MusicalScore.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;GoTo&lt;br /&gt;JavaScript&lt;br /&gt;URI&lt;br /&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;br /&gt;Widget&lt;br /&gt;Link&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Link&lt;br /&gt;Widget&lt;br /&gt;Screen&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;LabelExample.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;GoTo3DView&lt;br /&gt;JavaScript&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;br /&gt;Widget&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;br /&gt;Widget&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;MultiMedia_Acro6.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AVI+Transitions Demo.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;20020402_CALOS.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Movie&lt;br /&gt;Hide&lt;br /&gt;Named&lt;br /&gt;GoTo&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Link&lt;br /&gt;Movie&lt;br /&gt;Widget&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Disney-Flash.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;URI&lt;br /&gt;Rendition&lt;br /&gt;SubmitForm&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;br /&gt;Link&lt;br /&gt;Screen&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;br /&gt;Link&lt;br /&gt;Screen&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;gXsummer2004-stream.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SVG-AnnotAnim.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SVG&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Service Form_media.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SubmitForm&lt;br /&gt;Named&lt;br /&gt;ResetForm&lt;br /&gt;JavaScript&lt;br /&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;br /&gt;Link&lt;br /&gt;Screen&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;br /&gt;Widget&lt;br /&gt;Link&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Binder_6-3DPages.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;GoTo&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;us_population.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;GoTo&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SVG&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SVG&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SVG.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JavaScript&lt;br /&gt;URI&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;phlmapbeta7.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ScriptEvents.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JavaScript&lt;br /&gt;GoTo&lt;br /&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;br /&gt;Widget&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;br /&gt;Screen&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Trophy.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JavaScript&lt;br /&gt;URI&lt;br /&gt;GoTo&lt;br /&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;br /&gt;Widget&lt;br /&gt;Link&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;br /&gt;Link&lt;br /&gt;Screen&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;VolvoS40V50-Full.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Named&lt;br /&gt;JavaScript&lt;br /&gt;GoTo&lt;br /&gt;URI&lt;br /&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Widget&lt;br /&gt;Link&lt;br /&gt;Screen&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Screen&lt;br /&gt;Widget&lt;br /&gt;Link&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Jpeg_linked.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Named&lt;br /&gt;GoTo&lt;br /&gt;Rendition&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Link&lt;br /&gt;Screen&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Link&lt;br /&gt;Screen&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AdobeChassisDemo-commented.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;br /&gt;Popup&lt;br /&gt;Polygon&lt;br /&gt;Text&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Popup&lt;br /&gt;Polygon&lt;br /&gt;3D&lt;br /&gt;Text&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;drape_raster_contour_sample.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;URI&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Link&lt;br /&gt;3D&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;br /&gt;Link&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;remotemovieurl.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Movie&lt;br /&gt;FreeText&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3-D_PDF.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;movie_down1.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Movie&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Note that the table doesn’t have an “Actions” column for JHOVE. This is because, unlike VeraPDF, JHOVE’s output doesn’t include any information on actions at all.&lt;/p&gt;

&lt;p&gt;Both VeraPDF and JHOVE provide information about annotations, and even though the results are broadly similar for both tools, there are some noteworthy differences.&lt;/p&gt;

&lt;p&gt;For the files &lt;a href=&quot;https://web.archive.org/web/20100714002808/http://acroeng.adobe.com:80/Test_Files/movie/movie.pdf&quot;&gt;“movie.pdf”&lt;/a&gt;, &lt;a href=&quot;https://web.archive.org/web/20140519130059/http://acroeng.adobe.com/Test_Files/classic_multimedia//20020402_CALOS.pdf&quot;&gt;“20020402_CALOS.pdf”&lt;/a&gt;, &lt;a href=&quot;https://web.archive.org/web/20100714002816/http://acroeng.adobe.com:80/Test_Files/movie/remotemovieurl.pdf&quot;&gt;“remotemovieurl.pdf”&lt;/a&gt; and &lt;a href=&quot;https://web.archive.org/web/20100714002811/http://acroeng.adobe.com:80/Test_Files/movie/movie_down1.pdf&quot;&gt;“movie_down1.pdf”&lt;/a&gt;, JHOVE doesn’t report any annotations at all. In all four cases the &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/ae-multimedia/20020402_CALOS-jhove.xml&quot;&gt;output&lt;/a&gt; contains the following annotation-related validation error;&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;message&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;offset=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;17342&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;severity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;error&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDF-HUL-120&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Annotation dictionary missing required type (S) entry&lt;span class=&quot;nt&quot;&gt;&amp;lt;/message&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;VeraPDF’s output shows that all of these files contain a Movie annotation. I’m not sure if this is a coincidence.&lt;/p&gt;

&lt;p&gt;For the file “SVG-AnnotAnim.pdf”, JHOVE doesn’t report the SVG annotation. A look at the &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/ae-multimedia/SVG-AnnotAnim-jhove.xml&quot;&gt;output&lt;/a&gt; shows that JHOVE wasn’t able to parse this file at all, with the following validation error:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;message&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;offset=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;severity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;error&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDF-HUL-144&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Pages dictionary has no Type key or it has a null value.&lt;span class=&quot;nt&quot;&gt;&amp;lt;/message&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;behaviour-with-corrupted-files&quot;&gt;Behaviour with corrupted files&lt;/h3&gt;

&lt;p&gt;Two PDFs in this dataset are corrupted: “gXsummer2004-stream.pdf” and “AVI+Transitions Demo.pdf”. For both files, &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/ae-multimedia/gXsummer2004-stream-vera.xml&quot;&gt;VeraPDF’s output&lt;/a&gt; reports a parse exception:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;taskResult&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PARSE&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;isExecuted=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;isSuccess=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;false&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Meanwhile &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/ae-multimedia/gXsummer2004-stream-jhove.xml&quot;&gt;JHOVE&lt;/a&gt; throws a “No PDF header” validation error:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;message&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;offset=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;severity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;error&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDF-HUL-137&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;No PDF header&lt;span class=&quot;nt&quot;&gt;&amp;lt;/message&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;support-for-pdf-20&quot;&gt;Support for PDF 2.0&lt;/h2&gt;

&lt;p&gt;Since it’s been almost six years now since the &lt;a href=&quot;https://www.iso.org/news/ref2199.html&quot;&gt;launch&lt;/a&gt; of PDF 2.0, I was curious to see to what extent it is supported by VeraPDF and JHOVE. To find out, I ran both on &lt;a href=&quot;https://github.com/pdf-association/pdf20examples&quot;&gt;this set of PDF 2.0 example files&lt;/a&gt; by the PDF Association.&lt;/p&gt;

&lt;p&gt;Out of the 7 files in this dataset, 6 resulted in the following JHOVE error (full output &lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation/blob/main/output/pdf20examples/Simple%20PDF%202.0%20file-jhove.xml&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;message&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;offset=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;severity=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;error&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDF-HUL-137&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;No PDF header&lt;span class=&quot;nt&quot;&gt;&amp;lt;/message&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;These files were subsequently not parsed at all. The only exception here is this &lt;a href=&quot;https://github.com/pdf-association/pdf20examples/blob/master/PDF%202.0%20via%20incremental%20save.pdf&quot;&gt;“PDF 2.0 via incremental save.pdf”&lt;/a&gt; file, which has “1.7” as its designated version number in the PDF header.&lt;/p&gt;

&lt;p&gt;By contrast, VeraPDF was able to read all PDF 2.0 files in this dataset without any problems.&lt;/p&gt;

&lt;h2 id=&quot;discussion&quot;&gt;Discussion&lt;/h2&gt;

&lt;p&gt;The combined results of the above analyses of the Horror Corpus, the Adobe Acrobat Engineering dataset and the PDF 2.0 example files allow us to draw a number of conclusions.&lt;/p&gt;

&lt;h3 id=&quot;jhove-9&quot;&gt;JHOVE&lt;/h3&gt;

&lt;p&gt;First of all, even though both VeraPDF and JHOVE are able to detect many PDF features that are associated with known preservation risks, JHOVE has several limitations that make it less appealing for this purpose:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;JHOVE’s output doesn’t include any information about &lt;strong&gt;actions&lt;/strong&gt;, which means it cannot be used for detecting e.g. JavaScript or Launch actions (which are often indicative of external dependencies).&lt;/li&gt;
  &lt;li&gt;Many PDF features that are preservation risks are associated with &lt;strong&gt;annotations&lt;/strong&gt;. Even though JHOVE is able to report most of these, the comparison against VeraPDF’s output shows that JHOVE’s reporting of annotations is often incomplete.&lt;/li&gt;
  &lt;li&gt;JHOVE’s reporting on &lt;strong&gt;encryption and security-related restrictions&lt;/strong&gt; could be more informative. Most importantly, it doesn’t allow one to single out files that require an &lt;strong&gt;open password&lt;/strong&gt; (which present one of the most serious PDF preservation risks).&lt;/li&gt;
  &lt;li&gt;Moreover, JHOVE makes the detection of &lt;strong&gt;user access restrictions&lt;/strong&gt; (print, copy and text access) unnecessarily difficult by only reporting these features if they are &lt;em&gt;not&lt;/em&gt; restricted. This is impractical, because it means we have to check for the &lt;em&gt;absence&lt;/em&gt; of these features in JHOVE’s output. It also assumes prior knowledge on the user’s behalf of properties that are completely undocumented.&lt;/li&gt;
  &lt;li&gt;Even though JHOVE does report whether &lt;strong&gt;fonts are embedded&lt;/strong&gt;, this information is encoded in a way that is needlessly complicated, because it involves checking the output for the &lt;em&gt;absence&lt;/em&gt; (again!) of three properties that are undocumented.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;No documentation exists of any of JHOVE’s reported properties&lt;/strong&gt;. This can make their interpretation difficult. The aforementioned font embedding issue is a good example of this: if we want to know if a particular font is embedded or not, we need to check three separate properties (&lt;em&gt;FontFile&lt;/em&gt;, &lt;em&gt;FontFile2&lt;/em&gt; and &lt;em&gt;FontFile3&lt;/em&gt;), which are all undocumented. I was only able to figure this out by digging into JHOVE’s source code, and consulting the ISO 32000-1 PDF specification.&lt;/li&gt;
  &lt;li&gt;Another (but related) problem is JHOVE’s rather &lt;strong&gt;clumsy and convoluted XML output format&lt;/strong&gt;, which is made up by a labyrinthine assortment of nested “property” elements. This makes the format difficult to read, either by a human or a machine&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;. It’s not possible to address the reported properties directly using, for example, standard &lt;a href=&quot;https://en.wikipedia.org/wiki/XPath&quot;&gt;xpath&lt;/a&gt; expressions. This is because all properties are encoded as identically-named &lt;em&gt;property&lt;/em&gt; elements, where the name of each individual property is the text value of its &lt;em&gt;name&lt;/em&gt; child element. Of course this doesn’t preclude parsing the format altogether, but it does make working with JHOVE’s output considerably harder than most modern, well-designed XML formats.&lt;/li&gt;
  &lt;li&gt;Finally, the tests showed that JHOVE &lt;strong&gt;lacks support for PDF 2.0&lt;/strong&gt;. On encountering a document with a PDF 2.0 header, JHOVE simply reports an error and stops parsing the file. As a result, JHOVE’s output on PDF 2.0 documents does not contain any meaningful information.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;verapdf-9&quot;&gt;VeraPDF&lt;/h3&gt;

&lt;p&gt;VeraPDF was able to detect all features that are associated with known preservation risks in the “Horror Corpus” files. Unlike JHOVE, VeraPDF also reports on actions. VeraPDF’s coverage of annotations is more comprehensive, and it also provides detailed information about security-related restrictions. I was unable to find any detailed documentation of VeraPDF’s reported properties, but the documentation does provide a &lt;a href=&quot;https://docs.verapdf.org/cli/config/#features.xml&quot;&gt;general overview of the reported feature categories&lt;/a&gt;. Also, VeraPDF’s XML output format is much simpler, better structured and easier to navigate than the one used by JHOVE. Each reported property can be addressed directly by the name of its corresponding XML element, and the property names are mostly self-explanatory.&lt;/p&gt;

&lt;p&gt;For convenience, the following table summarises the elements in VeraPDF’s output that turned out to be the most interesting ones for the current analysis (asterisks indicates repeatable elements):&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Element (relative to &lt;em&gt;job&lt;/em&gt; element)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Significance&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;taskResult&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Parse status, rough validity proxy, may be used to identify malformed files and files that require open password&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;featuresReport/documentSecurity&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Child elements indicate security-related restrictions (e.g. copy, printing and text access passwords)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;featuresReport/actions/action*&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Some actions imply a preservation risk&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;featuresReport/annotations/annotation*&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Some annotations imply a preservation risk&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;featuresReport/documentResources/fonts/font/fontDescriptor/embedded*&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Value indicates whether font is embedded or not&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;featuresReport/embeddedFiles/embeddedFile*&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Indicates presence of file attachment&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;In order to report these, at minimum the following feature types must be enabled in VeraPDF’s configuration file:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;yes&quot;?&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;featuresConfig&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;enabledFeatures&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;ACTION&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;ANNOTATION&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;DOCUMENT_SECURITY&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;EMBEDDED_FILE&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;FONT&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/enabledFeatures&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/featuresConfig&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The tests with the Horror Corpus, the Acrobat Engineering files and the PDF 2.0 example files show that between VeraPDF and JHOVE, VeraPDF is the better choice for detecting specific PDF features that are associated with preservation risks. VeraPDF is able to report on a wider range of features, its output is far easier to interpret and process, and it is less likely to “give up” on files. A limitation of VeraPDF is that, unlike JHOVE, it is not capable of validating PDF’s lower-level data structures (full PDF validation). Such deviations from the published specifications can result in a whole separate class of preservation risks, but these are outside the scope of this work. VeraPDF’s parse status may be used as a rough validity proxy to identify malformed files. Depending on what preservation risks are considered important (which in turn depends on institutional contexts and preservation policies), both VeraPDF and JHOVE should be seen as complementary to each other.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to Sam Alloing (KB) and Tyler Thorsted (Brigham Young University) for their feedback to an earlier draft of this post, and for various (online) discussions related to this work.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdf-characterisation&quot;&gt;Github repo with analysis scripts and raw tool output&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus&quot;&gt;Open Preservation Foundation format corpus&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/pdf-association/pdf20examples&quot;&gt;PDF 2.0 examples&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://docs.google.com/spreadsheets/d/1eW7R8yACBciNimr16Z2ptC7fs1FlmZMnzdtG_DHBuD4/edit?usp=sharing&quot;&gt;PDF Significant Properties Spreadsheet by Tyler Thorsted&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;See the KB &lt;a href=&quot;https://www.kb.nl/en/file-download/download/public/842&quot;&gt;File Format Guidelines&lt;/a&gt; for more details. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;See &lt;a href=&quot;/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression&quot;&gt;“Identification of PDF preservation risks with Apache Preflight: a first impression”&lt;/a&gt;, &lt;a href=&quot;/2013/07/25/identification-pdf-preservation-risks-sequel&quot;&gt;“Identification of PDF preservation risks with Apache Preflight: the sequel”&lt;/a&gt; and &lt;a href=&quot;/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus&quot;&gt;“Identification of PDF preservation risks: analysis of Govdocs selected corpus”&lt;/a&gt;. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;See e.g. &lt;a href=&quot;https://zenodo.org/record/1228650&quot;&gt;“A PDF Test-Set for Well-Formedness Validation in JHOVE - The Good, the Bad and the Ugly”&lt;/a&gt; by Lindlar, Tunnat and Wilson. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Several other tools PDF validators exist, but none of these, including JHOVE, have &lt;a href=&quot;https://www.pdfa.org/wp-content/until2016_uploads/2015/12/iPres2014-CanonicalPDF-submission_20140827.pdf&quot;&gt;gained widespread acceptance by industry and other stakeholders&lt;/a&gt;. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;This is explained in VeraPDF’s &lt;a href=&quot;https://docs.verapdf.org/cli/config/#features.xml&quot;&gt;documentation&lt;/a&gt;. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;I initially assumed the &lt;em&gt;OwnerString&lt;/em&gt; and &lt;em&gt;UserString&lt;/em&gt; properties might be related to the presence of an owner password, but these are also reported for PDFs with other security-related restrictions, like print and copy passwords. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;JavaScript detection wasn’t possible in &lt;a href=&quot;/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression&quot;&gt;an earlier analysis I did in 2017&lt;/a&gt;, because VeraPDF didn’t support actions at that time. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;For my analyses I had to re-run JHOVE with output set to TEXT on various occasions to make sense of the output. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2023/05/25/identification-of-pdf-preservation-risks-with-verapdf-and-jhove</link>
                <guid>https://bitsgalore.org/2023/05/25/identification-of-pdf-preservation-risks-with-verapdf-and-jhove</guid>
                <pubDate>2023-05-25T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Extracting text from EPUB files in Python</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/03/clockwork-extraction.jpg&quot; alt=&quot;Street scene showing crowd gathered around an open carriage in which a dentist performs a tooth extraction on a patient. Next to the patient a man is banging on a large drum.&quot; /&gt;
  &lt;figcaption&gt;Clockwork picture of an itinerant dentist performing an extraction in French rural scene, wood frame, metal workings, first half 19th century. &lt;a href=&quot;https://wellcomecollection.org/works/hwpe3cxp&quot;&gt;Science Museum, London&lt;/a&gt;. &lt;a href=&quot;https://creativecommons.org/licenses/by/4.0/&quot;&gt;Attribution 4.0 International (CC BY 4.0)&lt;/a&gt; (cropped from original).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;This blog post provides a brief introduction to extracting unformatted text from EPUB files. The occasion for this work was a request by my Digital Humanities colleagues who are involved in the &lt;a href=&quot;https://www.surf.nl/en/news/sane-secure-data-environment-for-social-sciences-and-humanities&quot;&gt;SANE (Secure ANalysis Environment) project&lt;/a&gt;. The work on this project includes a use case that will use the SANE environment to analyse text from novels in EPUB format. My colleagues were looking for some advice on how to implement the text extraction component, preferably using a Python-based solution.&lt;/p&gt;

&lt;p&gt;So, I started by making a shortlist of potentially suitable tools. For each tool, I wrote a minimal code snippet for processing one file. Based on this I then created some simple demo scripts that show how each tool is used within a processing workflow. Next, I applied these scripts to two data sets, and used the results to obtain a first impression of the performance of each of the tools.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;evaluated-tools&quot;&gt;Evaluated tools&lt;/h2&gt;

&lt;p&gt;I evaluated the following tools:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/chrismattmann/tika-python&quot;&gt;&lt;strong&gt;Tika-python&lt;/strong&gt;&lt;/a&gt;. This is a Python wrapper for &lt;a href=&quot;https://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt; (which itself is a Java application). Apache Tika is a toolkit for text and metadata extraction from a wide range of file formats, including EPUB.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/deanmalmgren/textract&quot;&gt;&lt;strong&gt;Textract&lt;/strong&gt;&lt;/a&gt;. This offers text extraction functionality that is similar to Tika, but unlike Tika, Textract is natively written in Python.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/aerkalov/ebooklib&quot;&gt;&lt;strong&gt;EbookLib&lt;/strong&gt;&lt;/a&gt;. This is a Python library for reading and writing E-books in various formats, including EPUB (both EPUB 2 en EPUB 3). EbookLib is also the E-book library that is used by Textract.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/pymupdf/PyMuPDF&quot;&gt;&lt;strong&gt;PyMuPDF&lt;/strong&gt;&lt;/a&gt;. This is a Python binding for &lt;a href=&quot;https://mupdf.com/&quot;&gt;MuPDF&lt;/a&gt;. MuPDF is primarily a PDF library, but it also supports EPUB.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The following table shows the versions of these tools that I used in my tests:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Software&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Version&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Tika-python&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.6.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Textract&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.6.5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;EbookLib&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0.18&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;PyMuPDF&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.24.11&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;test-environment-and-data&quot;&gt;Test environment and data&lt;/h2&gt;

&lt;p&gt;For all of my tests I used a simple desktop PC running Linux Mint 20.1 (Ulyssa), MATE edition, with Python 3.8.10.&lt;/p&gt;

&lt;p&gt;I used two data sets:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A selection of 15 files in &lt;a href=&quot;https://idpf.org/epub/201&quot;&gt;EPUB 2.0.1&lt;/a&gt; format from the KB’s &lt;a href=&quot;https://www.dbnl.org&quot;&gt;DBNL&lt;/a&gt; (Digital Library for Dutch Literature) collection.&lt;/li&gt;
  &lt;li&gt;A selection of 10 files in &lt;a href=&quot;https://www.w3.org/publishing/epub3/epub-spec.html&quot;&gt;EPUB 3.2&lt;/a&gt; format from &lt;a href=&quot;https://standardebooks.org/&quot;&gt;Standard Ebooks&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All files in both data sets are structurally valid EPUB (2.0.1 / 3.2): validation with &lt;a href=&quot;https://github.com/w3c/epubcheck&quot;&gt;EPUBCheck&lt;/a&gt; 4.2.6 didn’t result in any reported errors or warnings&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;tika-python&quot;&gt;Tika-python&lt;/h2&gt;

&lt;p&gt;After installing Tika-python, as a first test I tried to write a minimal code snippet that extracts the text from one single EPUB, and then writes the result as UTF-8 encoded text to a file. Following Tika-python’s &lt;a href=&quot;https://github.com/chrismattmann/tika-python/blob/master/README.md&quot;&gt;README&lt;/a&gt; (the example under “Parser Interface”), I started out with with this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;#! /usr/bin/env python3
&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tika&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tika&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parser&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.epub&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.txt&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;from_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;encoding&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;utf-8&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;metadata-strings-in-text-output&quot;&gt;Metadata strings in text output&lt;/h3&gt;

&lt;p&gt;Inspection of the resulting output file showed a succession of text strings with the names of embedded fonts towards the end of the file. As an example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Charis SIL Bold Italic

::
::

Charis SIL Small Caps
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When I ran Tika (the Java application) directly without using the Tika-python wrapper, results were as expected. A closer inspection of the Tika-python source code showed that Tika-python’s parsing of the Tika output doesn’t quite work the way it should, with the result that extracted metadata is erroneously included in the text output.&lt;/p&gt;

&lt;h3 id=&quot;workaround-set-service-to-text&quot;&gt;Workaround: set service to text&lt;/h3&gt;

&lt;p&gt;Fortunately there’s a simple workaround for this. In the parser function call, just add the “service” parameter and set its value to “text”, as shown here:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;#! /usr/bin/env python3
&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tika&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tika&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parser&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.epub&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.txt&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;from_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;service&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;parsed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;encoding&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;utf-8&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;With this change, the font-related text strings were no longer reported.&lt;/p&gt;

&lt;h3 id=&quot;image-tags-and-alt-text-strings&quot;&gt;Image tags and alt-text strings&lt;/h3&gt;

&lt;p&gt;Unfortunately, setting the “service” parameter in this way has the unexpected side-effect that the text output now includes tags with alt-text descriptions for any images in the file. For example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;[image: cover]


Aster Berkhof

Veel geluk, professor!


[image: DBNL]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;different-behaviour-between-tika-app-and-tikaserver&quot;&gt;Different behaviour between Tika app and TikaServer&lt;/h3&gt;

&lt;p&gt;I initially thought this was also a bug in Tika-python, but it turns out this isn’t the case. Using the Tika Java application directly:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java -jar ~/tika/tika-app-2.6.0.jar -t berk011veel01_01.epub  &amp;gt; berk011veel01_01-app.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in an output file with no alt-text strings. However, Tika-python doesn’t wrap around Tika-app, but instead around &lt;a href=&quot;https://cwiki.apache.org/confluence/display/TIKA/TikaServer&quot;&gt;TikaServer&lt;/a&gt;. After starting TikaServer, I used the command below to processes the same EPUB:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;curl -T berk011veel01_01.epub  http://localhost:9998/tika  --header &quot;Accept: text/plain&quot; &amp;gt; berk011veel01_01-server.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The resulting file also included the offending image tags and alt-text strings. So, the Tika application and TikaServer behave differently. After reporting &lt;a href=&quot;https://issues.apache.org/jira/browse/TIKA-3969?filter=12326160&quot;&gt;an issue&lt;/a&gt; for this, I received a confirmation from Tika’s lead developer:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;There’s a subtle difference in the handlers used in tika-app and tika-server. We’re using the “RichTextContentHandler” in server but not in app. I think I’ve known about this for a while, but we’ll be breaking behaviour for whichever one we fix.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I also created a &lt;a href=&quot;https://github.com/chrismattmann/tika-python/issues/389&quot;&gt;separate issue&lt;/a&gt; at Tika-python for the inclusion of metadata in the text output. Unfortunately this issue is closely related to (and partly the result of) the upstream issue in TikaServer. So until that upstream issue is fixed, the current (slightly confusing) situation will most likely persist.&lt;/p&gt;

&lt;h3 id=&quot;ocr-if-tesseract-is-installed&quot;&gt;OCR if Tesseract is installed&lt;/h3&gt;

&lt;p&gt;By default, Tika applies optical character recognition (OCR) to any images in an EPUB if the &lt;a href=&quot;https://github.com/tesseract-ocr/tesseract&quot;&gt;Tesseract&lt;/a&gt; software is installed, and includes the OCR output in the extracted text. In many cases (at least for ours!) this might not be the desired behaviour. I only found out about this weeks after doing the original tests that are described in this post. Re-running some of the tests suddenly resulted in slightly larger output files, with text output that wasn’t originally there. It turns out that the root cause was that I had installed some software that installs Tesseract as a dependency (but I wasn’t aware of this). It’s possible to &lt;a href=&quot;https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr&quot;&gt;disable OCR&lt;/a&gt; in the Java application and TikaServer using a &lt;a href=&quot;https://tika.apache.org/1.9/configuring.html#Using_a_Tika_Configuration_XML_file&quot;&gt;command-line option&lt;/a&gt; that points to a  configuration file. I haven’t found a way to do this in Tika-python. The safest option might be to make sure that Teseract is not installed, or to rename Tesseract’s installation folder.&lt;/p&gt;

&lt;h2 id=&quot;textract&quot;&gt;Textract&lt;/h2&gt;

&lt;p&gt;As with Tika-python, as a first test I again created a minimal code snippet for processing one EPUB file:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;#! /usr/bin/env python3
&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;textract&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.epub&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.txt&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;textract&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;encoding&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;utf-8&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;encoding&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;utf-8&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the very first EPUB file (from the DBNL collection) this resulted in an empty output file. Results were similar for most other DBNL EPUBs, and Textract only managed to extract a handful of words at most. Results were considerably better for the “Standard Ebooks” files, with output that was similar to Tika-python in most cases. I &lt;a href=&quot;https://github.com/deanmalmgren/textract/issues/455&quot;&gt;reported&lt;/a&gt; this issue with the developers.&lt;/p&gt;

&lt;h2 id=&quot;ebooklib&quot;&gt;EbookLib&lt;/h2&gt;

&lt;p&gt;I mainly included EbookLib, because Textract uses it “under the hood” for EPUB, and I was curious if using it directly would give me similar results as Textract. Based on its &lt;a href=&quot;https://docs.sourcefabric.org/projects/ebooklib/en/latest/tutorial.html#reading-epubdocs.sourcefabric.org/projects/ebooklib/en/latest/tutorial.html#reading-epub&quot;&gt;documentation&lt;/a&gt; I created the following minimal code snippet:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;#! /usr/bin/env python3
&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;html.parser&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HTMLParser&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ebooklib&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ebooklib&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;epub&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.epub&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.txt&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;book&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;epub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;read_epub&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;item&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;book&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;get_items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;item&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;get_type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ebooklib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ITEM_DOCUMENT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;bodyContent&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;item&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;get_body_content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;decode&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;HTMLFilter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;feed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bodyContent&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;encoding&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;utf-8&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Compared to Tika-python and Textract, the EbookLib script is a bit more involved, as EbookLib doesn’t provide any high-level text extraction functions. Instead, the user must iterate over all document items, extract the (X)HTML, and then convert that to unformatted text. At first glance, tests with the DBNL and Standard Ebooks EPUBs didn’t result in any issues, and the results were similar to Tika-python.&lt;/p&gt;

&lt;h2 id=&quot;pymupdf&quot;&gt;PyMuPDF&lt;/h2&gt;

&lt;p&gt;For PyMuPDF, I created the following minimal code snippet:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;#! /usr/bin/env python3
&lt;/span&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pymupdf&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.epub&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;berk011veel01_01.txt&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pymupdf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileIn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;noChapters&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chapter_count&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;# Iterate over chapters
&lt;/span&gt;    &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;noChapters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;chapter_page_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;chapter_page_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;chapter_text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Iterate over pages in chapter
&lt;/span&gt;        &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;chapter_page_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;page&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
            &lt;span class=&quot;n&quot;&gt;chapter_text&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;page&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;get_text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;chapter_text&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Add linebreak to mark end of chapter
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;content&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;open&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fileOut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;encoding&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;utf-8&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;fout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;write&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;content&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There are a couple of things to note here.&lt;/p&gt;

&lt;h3 id=&quot;chapters-and-pages&quot;&gt;Chapters and pages&lt;/h3&gt;

&lt;p&gt;First, like EbookLib, we need to explicitly iterate over all chapters in the EPUB. Second, PyMuPDF’s document model is built on &lt;em&gt;pages&lt;/em&gt;, which probably reflects its origins as a PDF library. However, the EPUB format doesn’t really have any notion of “pages” at all. Nevertheless, since PyMuPDF’s text extraction function only works at the page level, we still need to iterate over the “pages” of each chapter, even though it’s not entirely clear to me how PyMuPDF defines them. Since by default the chapter texts are simply concatenated, I explicitly added a linebreak to more clearly delineate the end of each chapter.&lt;/p&gt;

&lt;h3 id=&quot;wrapping-and-linebreaks&quot;&gt;Wrapping and linebreaks&lt;/h3&gt;

&lt;p&gt;Unlike the other tools tested here, PyMuPDF wraps the extracted text to a fixed page width, and inserts linebreaks at the wrapping boundaries. As an example, look at the following sentence in the source XHTML, which is encoded as one single line:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;div&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;plat&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Het was wonderbaar. Pierre ademde diep, en de lucht was zo dun en zo ijl, dat zijn hoofd er duizelig van werd.&lt;span class=&quot;nt&quot;&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the corresponding PyMuPDF output, the text is split across two separate lines:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Het was wonderbaar. Pierre ademde diep, en de lucht was zo dun
en zo ijl, dat zijn hoofd er duizelig van werd.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Whether this is actually a problem will probably depend on the use case, but it’s good to be aware that this happens.&lt;/p&gt;

&lt;p&gt;Finally, for the EPUBs in the KB’s DBNL dataset, PyMuPDF reported multiple instances of the following error:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;MuPDF error: syntax error: css syntax error: unexpected token (OEBPS/template.css:1) (   &amp;gt;@&amp;lt;font-face {font-family: &quot;sc...)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The error is related to a style sheet, and doesn’t appear to affect text extraction.&lt;/p&gt;

&lt;h2 id=&quot;demonstration-scripts&quot;&gt;Demonstration scripts&lt;/h2&gt;

&lt;p&gt;Based on the above minimal code snippets, I created &lt;a href=&quot;https://github.com/KBNLresearch/textExtractDemo&quot;&gt;four simple demonstration scripts&lt;/a&gt; for Python-tika, Textract, Ebooklib and PyMuPDF. Each of these scripts extracts the text of each EPUB file in a user-defined input directory. The extracted text is then written to a user-defined output directory. Each script also writes a file with word counts for the extraction results, which is useful for a rough comparison of the different tools.&lt;/p&gt;

&lt;p&gt;I ran each script twice, using the DBNL and Standard Ebooks data sets as input, respectively.&lt;/p&gt;

&lt;h2 id=&quot;word-counts&quot;&gt;Word counts&lt;/h2&gt;

&lt;p&gt;The table below shows the resulting word counts for the books in the DBNL data set:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File name&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (Tika)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (Textract)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (EbookLib)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (PyMuPDF)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;eern001lief01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25450&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25446&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25451&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;spro002mure01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50553&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50549&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;50554&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;berk011veel01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;67978&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;67974&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;67978&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;sche034drie01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;203853&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;203352&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;203864&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;jous010supe01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;202495&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;202491&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;202494&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;dele035wegv01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;76536&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;76530&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;76530&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;verv017eerl01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;33844&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;33840&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;33855&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;dhae007euro01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;394455&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;394400&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;394879&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;gomm002uurw01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;43754&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;43731&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;43748&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;gang009lalb01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;28453&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;28381&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;28390&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;geel005bloe01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;76316&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;76312&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;76313&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;hart008droo02_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;77283&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;77279&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;77282&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;eede003vand04_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;120481&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;120310&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;120553&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;meij031tuss02_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;145678&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;145665&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;145692&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;maas013blau01_01.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;55099&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;55093&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;55108&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Note the extreme (near zero) word counts for Textract. The results for Tika, EbookLib and PyMuPDF are roughly the same.&lt;/p&gt;

&lt;p&gt;Running the scripts on the Standard Ebooks EPUBs gave the following result:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File name&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (Tika)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (Textract)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (EbookLib)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Words (PyMuPDF)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;william-shakespeare_king-lear.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;28442&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;18621&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;28430&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;28357&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;david-garnett_lady-into-fox.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25240&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25223&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25228&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;25208&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;joseph-conrad_heart-of-darkness.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;38717&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;38698&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;38705&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;38735&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;anthony-trollope_the-dukes-children.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;223014&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;222995&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;223002&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;222712&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;agatha-christie_the-mysterious-affair-at-styles.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57401&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57229&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57271&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57227&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;edgar-allan-poe_the-narrative-of-arthur-gordon-pym-of-nantucket.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;71931&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;71837&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;71863&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;71844&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;p-g-wodehouse_short-fiction.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;212224&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;212182&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;212212&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;212186&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;robert-louis-stevenson_the-strange-case-of-dr-jekyll-and-mr-hyde.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;26370&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;26345&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;26358&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;26291&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;h-g-wells_the-time-machine.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;33044&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;33024&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;33032&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;33018&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;thorstein-veblen_the-theory-of-the-leisure-class.epub&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;106537&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;106515&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;106525&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;106512&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;In this case, all four tools resulted in similar word counts. The exception here is the “King Lear” EPUB, which for Textract gave a word count that was about 10 thousand lower than for the other tools. I haven’t looked in detail where this difference is coming from exactly, but it confirms that in its current state, Textract isn’t a suitable tool for our purposes.&lt;/p&gt;

&lt;h2 id=&quot;table-of-contents&quot;&gt;Table of Contents&lt;/h2&gt;

&lt;p&gt;Depending on the structure of the source EPUB, the extraction result may or may not contain a table of contents. In &lt;a href=&quot;https://idpf.org/epub/20/spec/OPF_2.0.1_draft.htm&quot;&gt;EPUB 2&lt;/a&gt;, the table of contents is implemented as an XML-formatted “Navigation Control File” (NCX). The NCX was replaced by the &lt;a href=&quot;https://www.w3.org/publishing/epub3/epub-overview.html#sec-nav-nav-doc&quot;&gt;“Navigation Document”&lt;/a&gt; (which is an XHTML file) in EPUB 3. Neither Tika nor EbookLib extract NCX resources, but both do extract Navigation Documents. Consequently, in most cases the extraction result only includes a table of contents for EPUB 3 files. Both Textract and PyMuPDF extract neither the NCX nor the Navigation Document.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Based on these tests, Tika-python, EbookLib and PyMuPDF all look like potentially suitable Python-based tools for extracting unformatted text from EPUB files. Out of these, Tika-python provides the most straightforward interface. Tika also supports a wide range of other file formats, so any code based on Tika’s text extraction can be easily extended to other formats later.&lt;/p&gt;

&lt;p&gt;The inclusion of tags and alt-text descriptions for images in Tika’s output may be a problem though. As an example, imagine a researcher who uses Tika-python to analyse the emergence of certain words or phrases through time using EPUB versions of 19th century books. Any alt-text descriptions in such materials would most likely be contemporary, and as such they would “pollute” the original “signal” (19th century text) with modern language. So, prospective users of Tika-python should carefully review whether this behaviour is acceptable for their use case. The inclusion of optical character recognition output from embedded images in the extraction result can also result in some unexpected surprises, so it’s important that users are aware of Tika’s default behaviour in this regard.&lt;/p&gt;

&lt;p&gt;EbookLib doesn’t have these drawbacks, but the absence of a high-level text extraction interface does require some more work on the user’s side. Also, since EbookLib only supports a limited number of Ebook formats, extending any code based on it to other file formats will be less straightforward.&lt;/p&gt;

&lt;p&gt;Although PyMuPDF generally looks useful for EPUB text extraction, its built-in text wrapping with the addition of linebreaks might be unwanted for some use cases. PyMuPDF’s page-based document model also makes this library somewhat more involved to use, compared against the other tested tools.&lt;/p&gt;

&lt;p&gt;In its current form, Textract is not suitable for our use case.&lt;/p&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;It’s important to highlight the limitations of this analysis. First, it is based on only two small, homogeneous data sets, both of which only contain structurally valid EPUB files. It’s unclear how well these results translate to more heterogeneous collections (which often contain files that violate the format specifications in various ways). Second, the main objective here was to obtain a broad impression of the behaviour of the tested tools. The scope didn’t include an in-depth analysis of the accuracy and completeness of the extraction results. Finally, I didn’t look into the computational performance of the tested tools. As the SANE use case will only involve processing a limited number of files, performance isn’t important here.&lt;/p&gt;

&lt;h2 id=&quot;link-to-demo-scripts&quot;&gt;Link to demo scripts&lt;/h2&gt;

&lt;p&gt;EPUB text extraction demo:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/textExtractDemo&quot;&gt;https://github.com/KBNLresearch/textExtractDemo&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;21 January 2025: added PyMuPDF analysis.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;For convenience I actually used the EPUBCheck Python wrapper: &lt;a href=&quot;https://github.com/titusz/epubcheck/&quot;&gt;https://github.com/titusz/epubcheck/&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2023/03/09/extracting-text-from-epub-files-in-python</link>
                <guid>https://bitsgalore.org/2023/03/09/extracting-text-from-epub-files-in-python</guid>
                <pubDate>2023-03-09T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Moving my Internet domains</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/02/donkey-cart.png&quot; alt=&quot;Icon of a donkey that is pulling a cart that has the words bitsgalore.org written on it. In the background the sun is shining.&quot; /&gt;
  &lt;figcaption&gt;Donkey, cart and sun icons licensed from &lt;a href=&quot;https://thenounproject.com/&quot;&gt;the Noun Project&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;I recently moved the two Internet domains I own away from the UK-based domain registrar I’d been using since 2004 to a EU-based registrar. While the actual domain transfer was fairly simple, finding a registrar that suited my specific situation turned out more difficult than expected. Leaving my old registrar also resulted in a surprise. It’s unlikely that my situation is unique, so I thought it would be useful to share my experiences in this blog post, and point to some useful online resources that I found along the way. The move also allowed me to make my domains up to date with (mostly security-related) modern internet standards. I’ll briefly address this in the final sections of this post. This includes some suggestions on how to make these optimizations work with a GitHub Pages-hosted sites like this one.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;a-brief-history-of-my-domains&quot;&gt;A brief history of my domains&lt;/h2&gt;

&lt;p&gt;I registered my first (&lt;em&gt;.net&lt;/em&gt;) domain in early 2004. At the time, I was only looking for a “stable” e-mail address that was independent of any proprietary service or internet service provider. Having my own domain seemed the best way to realise this. I’ve been using this domain ever since, and my private e-mail address has remained unchanged throughout that 19 year period. By the end of 2013 I registered a second (&lt;em&gt;.org&lt;/em&gt;) domain, which is linked to this very site&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Both domains were registered with a UK-based registrar and domain hosting company, and I was also using their name servers. Both domains are linked to servers that are run by other providers for e-mail and web hosting. For example, this site uses &lt;a href=&quot;https://pages.github.com/&quot;&gt;GitHub Pages&lt;/a&gt; for web hosting.&lt;/p&gt;

&lt;h2 id=&quot;why-move&quot;&gt;Why move?&lt;/h2&gt;

&lt;p&gt;After my registrar was bought out by another company in 2018, it seems the new owners weren’t all that interested in keeping up with the latest Internet standards. The Dutch Internet Standards Platform has an &lt;a href=&quot;https://internet.nl/&quot;&gt;online tool&lt;/a&gt; that checks a domain for the use of modern (mostly security-related) Internet standards, and I used this to analyse this site some time ago. &lt;a href=&quot;https://web.archive.org/web/20230210222235/https://internet.nl/site/www.bitsgalore.org/1921771/&quot;&gt;This result&lt;/a&gt;, which is from a re-run I did just a few days before the domain transfer, shows that the site’s performance in this regard was very poor indeed:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/02/internetnl-10022023.png&quot; alt=&quot;Screenshot of website test summary info for domain www.bitsgalore.org. It shows the site got an overall score of 55%, reporting problems related to IPv6 reachability, DNSSEC and HTTPS.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Without going too much into detail here, the main culprits were:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;My registrar’s domain servers weren’t reachable via &lt;a href=&quot;https://en.wikipedia.org/wiki/IPv6&quot;&gt;IPv6&lt;/a&gt; addresses.&lt;/li&gt;
  &lt;li&gt;My registrar didn’t offer &lt;a href=&quot;https://en.wikipedia.org/wiki/Domain_Name_System_Security_Extensions&quot;&gt;DNSSEC&lt;/a&gt; (Domain Name System Security Extensions).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These technical shortcomings aside, in this post-Brexit era I’m not keen on routing my e-mail traffic through domain servers that are located in a non-EU country. On top of that, after a series of price hikes, my registrar had become significantly more expensive than most of its competitors. This was made even worse by the introduction of substantial bank transfer fees after Brexit. In the end, I felt the premium pricing I was paying was out of balance with the comparatively poor service provided.&lt;/p&gt;

&lt;h2 id=&quot;2030-a-domain-odyssey&quot;&gt;2030: A Domain Odyssey&lt;/h2&gt;

&lt;p&gt;With the renewal date of my email domain approaching again, I started the search for a new registrar. This was complicated somewhat by something that happened in November 2019. At that time, &lt;a href=&quot;https://en.wikipedia.org/wiki/Internet_Society&quot;&gt;Internet Society&lt;/a&gt; made an &lt;a href=&quot;https://www.internetsociety.org/news/press-releases/2019/ethos-capital-to-acquire-public-interest-registry-from-the-internet-society/&quot;&gt;announcement&lt;/a&gt; that they were about to sell the &lt;em&gt;.org&lt;/em&gt; top-level domain to a private equity firm. Just a few months earlier, &lt;a href=&quot;https://en.wikipedia.org/wiki/ICANN&quot;&gt;ICANN&lt;/a&gt; had lifted the price caps on &lt;em&gt;.org&lt;/em&gt; domains. This would give the new owner total freedom to raise prices on &lt;em&gt;.org&lt;/em&gt; domains without any limitations&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. As this would include renewals of existing domains, I immediately renewed the &lt;em&gt;bitsgalore.org&lt;/em&gt; domain for the maximum period possible (ten years), while it was still relatively cheap. In the end my ten year renewal proved to be unnecessary, as the sale of the &lt;em&gt;.org&lt;/em&gt; top-level domain was &lt;a href=&quot;https://www.theregister.com/2020/05/01/icann_stops_dot_org_sale/&quot;&gt;vetoed by ICANN&lt;/a&gt; several months later. But by that time I already was the proud owner of a domain that wouldn’t expire until late 2030&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;what-happens-to-the-remaining-registration-period&quot;&gt;What happens to the remaining registration period?&lt;/h2&gt;

&lt;p&gt;So what are the implications of transferring a domain that has already been paid for in advance? Does the new registrar take over the remaining contract period (until 2030) in that case? Or would the transfer reset the registration period, meaning I would lose the remaining 8 years at my previous registrar, and I’d have to pay for them again? My online research on this didn’t come up with any definitive answers, but most sources I found suggested that in most cases, any remaining years at the “old” registrar are automatically transferred to the “new” registrar. Additional payments are usually limited to either a transfer fee, or the renewal of the domain by one more year. See for example &lt;a href=&quot;https://webmasters.stackexchange.com/questions/44258/domain-name-transfer-do-you-lose-years-left-at-current-registrar&quot;&gt;this thread&lt;/a&gt; on Webmasters Stack Exchange, and &lt;a href=&quot;https://www.thesitewizard.com/domain/renewing-domain-name.shtml&quot;&gt;this article&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I also came across &lt;a href=&quot;https://www.icann.org/en/registry-agreements/multiple/revised-verisign-registry-agreements-appendix-c-16-4-2001-en&quot;&gt;this document that is part of ICANN’s registry agreements&lt;/a&gt;, which states (section 2.4 Transfer Grace Period):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Transfer (other than ICANN-approved bulk transfer). If a domain is transferred within the Transfer Grace Period, there is no credit. The expiration date of the domain is extended by one year up to a maximum term of ten years.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If I’m reading this correctly, this also suggests that transferring a domain to another registrar doesn’t result in the loss of the remaining registration period.&lt;/p&gt;

&lt;h2 id=&quot;dutch-treat&quot;&gt;Dutch treat&lt;/h2&gt;

&lt;p&gt;However, it seems not all registrars work that way. TransIP, which is a major Dutch domain registrar and hosting provider, &lt;a href=&quot;https://www.transip.nl/knowledgebase/artikel/24-een-domeinnaam-naar-transip-verhuizen/&quot;&gt;explicitly mention on their website&lt;/a&gt; that transferring an existing domain to them will start a new contract period, and that they will not take over any contract periods with the existing registrar. Two other Dutch registrars I contacted by e-mail told me this as well, adding that the transfer would reset the existing renewal date. Both advised me to stick with my current registrar. I don’t know if this behaviour is unique to Dutch registrars. In any case, the responses I got did prompt me to widen the search net to registrars in other EU countries.&lt;/p&gt;

&lt;h2 id=&quot;enter-the-franco-german-axis&quot;&gt;Enter the Franco-German axis&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://webmasters.stackexchange.com/a/44347&quot;&gt;One of the replies&lt;/a&gt; in the aforementioned Webmasters Stack Exchange thread mentions &lt;a href=&quot;https://en.wikipedia.org/wiki/Gandi&quot;&gt;Gandi&lt;/a&gt;, a French registration and hosting company. The link in that post doesn’t work anymore, but a quick search on Gandi’s documentation site turned up &lt;a href=&quot;https://docs.gandi.net/en/domain_names/transfer/transfer_table.html#ot&quot;&gt;this table&lt;/a&gt;. For a &lt;em&gt;.org&lt;/em&gt; domain, this states that, after transfer, the “new expiration date will be 1 year after current expiration date”. While searching for some more information about Gandi, I stumbled across &lt;a href=&quot;https://european-alternatives.eu/category/domain-name-registrar&quot;&gt;this website&lt;/a&gt;, which lists several EU-based domain name registrars. I’m not sure how independent the site is, and what its selection of registrars is exactly based on. With that said, I looked up the mentioned registrars, checked some online reviews, and this directed my attention to German domain and hosting company &lt;a href=&quot;https://www.inwx.com/&quot;&gt;INWX&lt;/a&gt; as another potential candidate.&lt;/p&gt;

&lt;p&gt;I e-mailed both Gandi and INWX, and asked them what would be the status of my &lt;em&gt;bitsgalore.org&lt;/em&gt; domain in case  I transferred it to them. Both companies confirmed they would take over the remaining 8 years of the contract with my current registrar, and would only charge for renewing the domain by one more year (extending the registration until 2031). This was exactly the answer I was hoping for, as this would save me 8 years worth of renewal fees. I ultimately settled on INWX, although both companies look like good choices to me.&lt;/p&gt;

&lt;h2 id=&quot;dealing-with-transfer-out-fees&quot;&gt;Dealing with transfer-out fees&lt;/h2&gt;

&lt;p&gt;I then started the transfer out with my old registrar&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;, who turned out to charge a £10 + VAT administration fee for this on each domain. From what I understand, this practice is quite unusual, and in my case it would set me back for some £24. I don’t consider myself overly stingy, but paying money like that just to leave a company that’s not providing a great service to begin with seemed a bit much. A quick search turned up &lt;a href=&quot;https://forums.digitalspy.com/discussion/comment/91500345/#Comment_91500345&quot;&gt;this forum thread&lt;/a&gt;. It links to ICANN’s &lt;a href=&quot;https://www.icann.org/resources/pages/name-holder-faqs-2017-10-10-en&quot;&gt;FAQs for Registrants: Transferring Your Domain Name&lt;/a&gt;, which states (under “&lt;em&gt;My registrar is charging me a fee to transfer to a new registrar. Is this allowed?&lt;/em&gt;”):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[…] Registrars are allowed to set their own prices for this service so some may choose to charge a fee. However, a transfer cannot be denied due to non-payment of this transfer fee. There are other reasons your registrar can deny transfer request. See FAQ #8 above for more information.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, I contacted my old registrar, and asked them to waive the transfer fee, referring to the ICANN document. To their credit, they granted my request without any fuss, and promptly sent me the authorization codes!&lt;/p&gt;

&lt;h2 id=&quot;starting-the-incoming-transfer&quot;&gt;Starting the incoming transfer&lt;/h2&gt;

&lt;p&gt;With these authorization codes, I could now start the incoming end of the transfer at INWX. It’s worth pointing out it took five days for the transfer to take effect (I received a notification e-mail about this); from what I understand this is usually the case for &lt;em&gt;.org&lt;/em&gt; and &lt;em&gt;.net&lt;/em&gt; domains. However, I was able to edit the DNS records at INWX well before the actual transfer date. When the transfer happened, it didn’t result in any noticeable downtime on either domain.&lt;/p&gt;

&lt;h2 id=&quot;modern-internet-standards&quot;&gt;Modern Internet standards&lt;/h2&gt;

&lt;p&gt;Once I had verified that everything was working OK after the transfer, I activated &lt;a href=&quot;https://en.wikipedia.org/wiki/Domain_Name_System_Security_Extensions&quot;&gt;DNSSEC&lt;/a&gt; on both of my domains. I then did a couple of tests with the &lt;a href=&quot;https://internet.nl/&quot;&gt;online tool&lt;/a&gt; of The Dutch Internet Standards Platform&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. A test on the &lt;em&gt;bitsgalore.org&lt;/em&gt; apex (root) domain now &lt;a href=&quot;https://web.archive.org/web/20230218140111/https://internet.nl/site/bitsgalore.org/1922728/&quot;&gt;resulted&lt;/a&gt; in a respectable 94% score, but repeating the test for the &lt;em&gt;www&lt;/em&gt; subdomain &lt;a href=&quot;https://web.archive.org/web/20230218140314/https://internet.nl/site/www.bitsgalore.org/1925412/&quot;&gt;only yielded a score of 70%&lt;/a&gt;. A closer look showed that the &lt;em&gt;www&lt;/em&gt; subdomain was failing some of the DNSSEC tests.&lt;/p&gt;

&lt;h2 id=&quot;making-dnssec-work-on-the-subdomain&quot;&gt;Making DNSSEC work on the subdomain&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://docs.github.com/en/pages/configuring-a-custom-domain-for-your-github-pages-site/managing-a-custom-domain-for-your-github-pages-site#configuring-an-apex-domain-and-the-www-subdomain-variant&quot;&gt;documentation of GitHub Pages&lt;/a&gt; on custom domains recommends to insert a &lt;a href=&quot;https://en.wikipedia.org/wiki/CNAME_record&quot;&gt;CNAME&lt;/a&gt; record for &lt;em&gt;www&lt;/em&gt; subdomains, and this is also how I originally set up this site. After checking with INWX’s support (which, by the way, is both excellent and fast) about my subdomain issue, they advised me to replace this CNAME record with a series of &lt;a href=&quot;https://en.wikipedia.org/wiki/List_of_DNS_record_types#A&quot;&gt;A&lt;/a&gt; records for the www subdomain. In the end I went one step further, and also added corresponding &lt;a href=&quot;https://en.wikipedia.org/wiki/List_of_DNS_record_types#AAAA&quot;&gt;AAAA&lt;/a&gt; records&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;To illustrate this further, I removed this record:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Name&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Type&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;TTL&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CNAME&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bitsgalore.github.io&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;And replaced it with the records below:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Name&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Type&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;TTL&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;185.199.111.153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;185.199.110.153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;185.199.108.153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;A&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;185.199.109.153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AAAA&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2606:50c0:8001::153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AAAA&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2606:50c0:8000::153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AAAA&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2606:50c0:8003::153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;www&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;AAAA&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;86400&lt;/code&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2606:50c0:8002::153&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here, the A records point to the &lt;a href=&quot;https://en.wikipedia.org/wiki/Internet_Protocol_version_4&quot;&gt;IPv4&lt;/a&gt; addresses of GitHub’s servers, and the AAAA records point to the &lt;a href=&quot;https://en.wikipedia.org/wiki/IPv6&quot;&gt;IPv6&lt;/a&gt; addresses of the same servers. I didn’t need to make any changes to the site’s GitHub repo.&lt;/p&gt;

&lt;p&gt;After this change, DNSSEC worked as intended for the &lt;em&gt;www&lt;/em&gt; subdomain, and the Internet Standards tool now &lt;a href=&quot;https://web.archive.org/web/20230217140008/https://internet.nl/site/www.bitsgalore.org/1928399/&quot;&gt;resulted in a 97% score&lt;/a&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/02/internetnl-14022023.png&quot; alt=&quot;Screenshot of website test summary info for domain www.bitsgalore.org. It shows the site got an overall score of 97%.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;There are still a couple of minor things that could be improved (mainly related to GitHub’s servers it seems). Overall I’m happy with this result though, especially when keeping in mind that the site only scored a meagre &lt;a href=&quot;https://web.archive.org/web/20230210222235/https://internet.nl/site/www.bitsgalore.org/1921771/&quot;&gt;55%&lt;/a&gt; just a week earlier.&lt;/p&gt;

&lt;h2 id=&quot;email&quot;&gt;Email&lt;/h2&gt;

&lt;p&gt;I also ran separate tests on &lt;a href=&quot;https://internet.nl/test-mail/&quot;&gt;email&lt;/a&gt; for both my domains, and used the results to apply some optimizations. This is beyond the scope of this post (which is already getting too long), but I think it deserves a brief mention. For example, in November 2022 Google implemented a &lt;a href=&quot;https://support.google.com/mail/answer/81126?hl=en&amp;amp;ref_topic=7279058#auth-reqs&quot;&gt;policy&lt;/a&gt; that requires that the domains of incoming email messages are secured with either &lt;a href=&quot;https://en.wikipedia.org/wiki/Sender_Policy_Framework&quot;&gt;SPF&lt;/a&gt; or &lt;a href=&quot;https://en.wikipedia.org/wiki/DomainKeys_Identified_Mail&quot;&gt;DKIM&lt;/a&gt;. If an incoming message doesn’t comply with this, Google’s servers will either mark it as spam, or even outright reject it. I learnt about this the hard way myself at the time, when all of a sudden I was unable to reach any of my email contacts with a Gmail address!&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The main thing I learnt about this domain move, is the importance of researching registrars’ terms and conditions for domains that have been paid for several years in advance. I was surprised at how different registrars handle this in unexpectedly different ways. The somewhat nebulous situation around transfer-out fees was another surprise, and it’s good to know one’s rights here. In the end, I estimate my online research in both areas saved me a total of around €150, which is quite substantial!&lt;/p&gt;

&lt;p&gt;This exercise also made me more aware of the importance of selecting a registrar that keeps up with modern Internet standards, by using name servers with IPv6 web addresses, and offering signed domain names (DNSSEC). Tools like &lt;a href=&quot;https://internet.nl/&quot;&gt;the one by the Dutch Internet Standards Platform&lt;/a&gt; are hugely helpful for compliance testing against these standards. Overall, I’m very happy with the outcome. Both my domains are now largely compliant with modern Internet standards, while running at a lower cost compared to my previous registrar.&lt;/p&gt;

&lt;h2 id=&quot;useful-links-and-resources&quot;&gt;Useful links and resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://internet.nl/&quot;&gt;Internet.nl&lt;/a&gt; - online tool by the Dutch Internet Standards Platform that tests web and email domains for use of modern Internet standards.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://european-alternatives.eu/category/domain-name-registrar&quot;&gt;European domain name registrars&lt;/a&gt; - lists some more EU-based domain registrars.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.icann.org/resources/pages/name-holder-faqs-2017-10-10-en&quot;&gt;FAQs for Registrants: Transferring Your Domain Name&lt;/a&gt; - by ICANN.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.icann.org/en/registry-agreements/multiple/revised-verisign-registry-agreements-appendix-c-16-4-2001-en&quot;&gt;Revised VeriSign Registry Agreements: Appendix C&lt;/a&gt; - part of ICANN’s &lt;a href=&quot;https://en.wikipedia.org/wiki/Generic_top-level_domain&quot;&gt;Generic Top-Level Domain&lt;/a&gt; (gTLD) Registry Agreements.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;This was originally meant to be a personal blog covering a variety of subjects. That never really worked out, and in 2019 I re-launched the site as a home for my digital preservation writings old and new. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;See The Register for details - &lt;a href=&quot;https://www.theregister.com/2019/11/20/org_registry_sale_shambles/&quot;&gt;“Internet world despairs as non-profit .org sold for $$$$ to private equity firm, price caps axed”&lt;/a&gt; &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;My other domain, which I only use for e-mail, is a .net domain, and I renew this annually. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;Before starting a domain transfer, it’s a good idea to first copy your domain’s DNS settings (which you can find in your registrar’s admin interface) to a text file. This will enable you to enter the relevant DNS records at your new registrar later. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;After the transfer it took several hours for the tool to pick up the new name servers. This is probably caching-related, and something to keep in mind when using tools like these on freshly transferred domains. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;I’m not completely sure the AAAA records are really necessary here. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2023/02/20/moving-my-internet-domains</link>
                <guid>https://bitsgalore.org/2023/02/20/moving-my-internet-domains</guid>
                <pubDate>2023-02-20T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Writing yet another workflow tool for imaging portable media</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/01/ipmlab-laptop-1024.jpg&quot; alt=&quot;Photo of a laptop running the Ipmlab software. In the foreground is a removable USB floppy drive with some 3.5 inch floppies lying on top of it. To the right of the laptop is a vintage floppy storage box that contains more floppies.&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;In 2017 I wrote &lt;a href=&quot;/2017/06/19/image-and-rip-optical-media-like-a-boss&quot;&gt;a blog post&lt;/a&gt; on &lt;a href=&quot;https://github.com/KBNLresearch/iromlab&quot;&gt;Iromlab&lt;/a&gt; (an acronym for “Image and Rip Optical Media Like A Boss”), a custom-built software tool that streamlines imaging and ripping of optical media using an Acronova Nimbie disc robot. The KB has been using Iromlab since 2019 as part of an &lt;a href=&quot;https://www.kb.nl/over-ons/projecten/digitalisering-optische-dragers&quot;&gt;ongoing effort&lt;/a&gt; to preserve the information contained in its vast collection of legacy optical media. This project is expected to reach its completion later this year, but as demonstrated &lt;a href=&quot;/2020/02/20/offline-digital-carriers-kb-deposit-collection&quot;&gt;by this earlier inventory&lt;/a&gt;, our deposit collection also contains various other types of legacy media that are under threat of becoming inaccessible. Out of these, 3.5 inch floppy disks are the most common data carriers (after optical media), so it made sense to focus on these as a next step.&lt;/p&gt;

&lt;p&gt;Using the existing Iromlab-based workflow as a starting point, I created a preliminary workflow tool that can be used for imaging our 3.5” floppies (and various other types portable media). In this post I’ll explain how this tool came about, and highlight some of the challenges I encountered during its development.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;the-plan&quot;&gt;The plan&lt;/h2&gt;

&lt;p&gt;Since we already have a workflow in place for optical media, it seemed to make sense to use Iromlab (which uses IsoBuster as the main imaging application) as a starting point. This would allow us to re-use most of the software components of the existing optical media workflow with only minor modifications. It also implied that Windows (10) would be the target platform. So far for the theory, but as we’ll see below, things got slightly more complicated and messy along the way!&lt;/p&gt;

&lt;h2 id=&quot;iromlab-to-ipmlab&quot;&gt;Iromlab to Ipmlab&lt;/h2&gt;

&lt;p&gt;As a first step I forked the existing Iromlab code. Unlike optical media, the possibilities for automating the load and unload process are limited for floppies (although some hackers have successfully &lt;a href=&quot;https://hackaday.com/2012/03/31/floppy-autoloader-takes-the-pain-out-of-archiving-5000-amiga-disks/&quot;&gt;repurposed vintage floppy duplicators into DIY autoloaders&lt;/a&gt;). So, I started by removing all code that controls the Nimbie disc robot, and adapted the overall worklow accordingly. Next I removed everything that is specific to optical media, and made some small changes to the IsoBuster call, in order to better accommodate 3.5 inch floppies. As all software needs a name, I settled on “Ipmlab” (Image Portable Media Like A Boss). In a twist of irony, the development of Ipmlab wasn’t exactly boss-like, as will become clear from the remainder of this post.&lt;/p&gt;

&lt;h2 id=&quot;enter-aaru&quot;&gt;Enter Aaru&lt;/h2&gt;

&lt;p&gt;Although the tests with an early prototype initially went well, I wasn’t entirely happy with IsoBuster’s behaviour in case of floppies with damaged sectors. Depending on the configuration settings, IsoBuster either requires manual intervention  in this case, or alternatively it aborts the imaging process altogether (filling in the missing sectors with placeholder bytes). On Twitter, Robin François suggested the open-source, cross-platform &lt;a href=&quot;https://www.aaru.app/&quot;&gt;Aaru&lt;/a&gt; software as a possible alternative imaging solution. This looked worthy of further exploration. Although I hadn’t used Aaru before, the results of some quick tests looked promising enough, so I added a simple Aaru wrapper module to Ipmlab.&lt;/p&gt;

&lt;h2 id=&quot;write-blocker-woes&quot;&gt;Write blocker woes&lt;/h2&gt;

&lt;p&gt;This all seemed to work fine at first, but once I connected any of my (external USB) floppy drives&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt; through a write blocker (Tableau T8u forensic USB Bridge), Aaru would sometimes throw device exceptions.&lt;/p&gt;

&lt;p&gt;It’s relevant here to mention that for testing I mostly used some old, DOS-formatted 3.5 inch floppies from my personal collection. These include both &lt;a href=&quot;https://obsoletemedia.org/3-5-inch-microfloppy-high-density/&quot;&gt;“high density”&lt;/a&gt; disks with a capacity of 1.44 MB, as well as some older &lt;a href=&quot;https://obsoletemedia.org/3-5-inch-microfloppy/&quot;&gt;“double density”&lt;/a&gt; disks, which can only hold 720 KB worth of data. The main pattern in the Aaru crashes was, that they typically happened whenever I tried to process a “high density” disk after having processed one or more “double density” ones, or vice versa. The crashes didn’t happen when the floppy drives were connected directly to my machine.&lt;/p&gt;

&lt;h2 id=&quot;ddrescue-tests&quot;&gt;Ddrescue tests&lt;/h2&gt;

&lt;p&gt;To get a better idea what was going on, I ran more tests using an alternative imaging application (&lt;a href=&quot;https://www.gnu.org/software/ddrescue/&quot;&gt;Ddrescue&lt;/a&gt;), and repeated these tests on another machine with a different operating system (Linux Mint).&lt;/p&gt;

&lt;p&gt;I started by imaging one “double density” (720 KB) disk. Any subsequent floppies were typically imaged without problems, provided that these were also “double density” disks. These each resulted in a 737 KB disk image, that I could mount normally on my Linux machine. On the other hand, following up a “double density” disk with a “high density” disk resulted in the following output from Ddrescue:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Press Ctrl-C to interrupt
     ipos:   720896 B
     opos:   720896 B
non-tried:        0 B
  rescued:   737280 B
pct rescued:  100.00%, &lt;span class=&quot;nb&quot;&gt;read &lt;/span&gt;errors:        0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This may look normal at first, as ddrescue does not report any errors. On closer inspection though, we see that only 737 KB of data were extracted from the disk. This is unexpected, because a “high density” disk should really result in a 1.5 MB disk image. But the 737 KB value &lt;em&gt;does&lt;/em&gt; correspond exactly to the expected size for a “double density” disk! So in spite of the lack of any error messages from Ddrescue, about half of the data on the disks are missing from the image file!&lt;/p&gt;

&lt;p&gt;As an additional test I switched off the write blocker, switched it on again, and then repeated the above experiment, but now starting with some “high density” disks. These were all imaged correctly, each resulting in 1.5 MB image files. Following this up with any of my “double density” disks invariably resulted in Ddrescue reporting read errors like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pct rescued:   48.88%, read errors:     7372
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So apparently Ddrescue expected 1.44 MB of data (the size of the “high density” disk I inserted first), instead of the 720 KB of a “double density” disk. The size of the resulting disk image in this case was 1.5 MB, with the first 737 KB containing the data from the disk, followed by null bytes for the remaining part of the file.&lt;/p&gt;

&lt;p&gt;Running Ddrescue on an &lt;em&gt;empty&lt;/em&gt; floppy drive resulted in an image with only null bytes (all bad blocks), with an image size of either 737 KB or 1.5 MB, depending on whether a “double” or “high” density floppy had been inserted first.&lt;/p&gt;

&lt;p&gt;Repeating any of the above tests with Aaru mostly resulted in Aaru device exceptions. At first sight, this all suggested that the medium size that is exposed by the write blocker to the operating system somehow remains “stuck” to the size of the first floppy that was loaded after switching it on, and isn’t updated afterwards.&lt;/p&gt;

&lt;h2 id=&quot;tableau-response&quot;&gt;Tableau response&lt;/h2&gt;

&lt;p&gt;I contacted Tableau (the manufacturer of the write blocker) about this, who suggested that the behaviour might be caused by an inability of the write blocker to differentiate between the actual floppy and the USB adapter device. They didn’t envisage a short-term solution for this because of the complexities involved, and advised me to look for some alternative solution that doesn’t use this write blocker.&lt;/p&gt;

&lt;h2 id=&quot;windows-writing-pesky-folders&quot;&gt;Windows writing pesky folders&lt;/h2&gt;

&lt;p&gt;This created a bit of a dilemma. By default, Windows 10 (the target platform for the workflow) automatically tries to write a &lt;a href=&quot;https://www.thewindowsclub.com/system-volume-information-folder&quot;&gt;“System Volume Information” folder&lt;/a&gt; to a newly detected floppy disk (or other storage medium for that matter):&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/01/svi-win10.png&quot; alt=&quot;Screenshot of file manager showing the contents of a floppy disk. In addition to the original files, it also shows a System Volunme Information directory that was added by Windows 10 after inserting the floppy into the machine.&quot; /&gt;
  &lt;figcaption&gt;Contents of a floppy. The System Volume Information directory was not originally part of the floppy, but was automatically added by Windows 10 after insertion into the floppy drive.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Of course we can prevent this by setting a floppy’s write-protect tab to the “protected” position (which is always a good idea). However, if an operator forgets to do this (and I think it’s reasonable to expect this will happen every once in a while), this would immediately result in changes to the source media. Reportedly &lt;a href=&quot;https://superuser.com/questions/1199823/how-to-prevent-creation-of-system-volume-information-folder-in-windows-10-for&quot;&gt;it is possible&lt;/a&gt; to disable the automatic creation of “System Volume Information” folders, but this involves some rather ugly messing with the Windows registry and the services settings.&lt;/p&gt;

&lt;h2 id=&quot;linux-to-the-rescue&quot;&gt;Linux to the rescue&lt;/h2&gt;

&lt;p&gt;After some deliberation with my colleagues on the operational side, we decided to abandon Windows as a target platform, and switch to &lt;a href=&quot;https://linuxmint.com/&quot;&gt;Linux Mint&lt;/a&gt; instead. Unlike Windows, Linux (Mint) doesn’t try to write anything to a floppy upon insertion. For additional protection we can &lt;a href=&quot;https://github.com/KBNLresearch/ipmlab/blob/main/doc/setupGuide.md#disable-automatic-mounting-of-removable-media&quot;&gt;disable automatic mounting of removable media&lt;/a&gt;, which is fairly easy to set up. In combination with using a floppy’s write-protect tab, this provides a level of protection against accidental write actions that looks pretty reasonable to me&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Since much of the code was Linux-compatible from the onset, adapting Ipmlab was relatively straightforward.&lt;/p&gt;

&lt;h2 id=&quot;return-of-ddrescue&quot;&gt;Return of Ddrescue&lt;/h2&gt;

&lt;p&gt;As initial tests with the adapted development version of Ipmlab showed no problems, I moved to the final step of packaging the code. Or so I thought. When running the installed version of Ipmlab, Aaru now invariably failed with an unhandled exception. After submitting an &lt;a href=&quot;https://github.com/aaru-dps/Aaru/issues/749&quot;&gt;issue on this&lt;/a&gt;, Aaru’s main developer Natalia Portillo suggested the issue was most likely caused by the console handler class that is used by the current stable (5.3) version of Aaru, and that the latest development version (which uses a different console handler) might give better results. I was able to confirm this with a test with the latest (6.0) development version, which indeed worked without any problems.&lt;/p&gt;

&lt;p&gt;However, as there’s still a lot of work to do before a stable 6.0 release will be ready, this again introduced a minor dilemma. In the end, I put my Aaru plans on hold (at least for now), and added a Ddrescue wrapper module to Ipmlab. I then made the imaging application a user-defined configuration variable, giving a user the choice between either Ddrescue or Aaru. This means that once Aaru 6.0 is ready for release, Ipmlab will support it&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;ipmlab-features&quot;&gt;Ipmlab features&lt;/h2&gt;

&lt;p&gt;Like Iromlab, Ipmlab uses simple a batch structure, where each imaged medium is represented by a single directory that contains the disk image and all associated metadata. This includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;a report file in &lt;a href=&quot;https://en.wikipedia.org/wiki/Digital_Forensics_XML&quot;&gt;Digital Forensics XML format&lt;/a&gt; (created using &lt;a href=&quot;https://www.sleuthkit.org/sleuthkit/&quot;&gt;Sleuth Kit&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;a file with bibliographical metadata&lt;/li&gt;
  &lt;li&gt;a &lt;a href=&quot;https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html#Mapfile-structure&quot;&gt;Ddrescue map file&lt;/a&gt; (only if Ddrescue is used)&lt;/li&gt;
  &lt;li&gt;various Aaru-specific metadata files (only if Aaru is used)&lt;/li&gt;
  &lt;li&gt;a file with SHA-512 checksums of all of the above files.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each batch also contains:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;a batch manifest, which is a comma-delimited text file with all information that is needed to process the batch into ingest-ready Submission Information Packages further down the processing chain&lt;/li&gt;
  &lt;li&gt;a detailed batch log&lt;/li&gt;
  &lt;li&gt;an “end-of-batch” file that indicates that the batch was finalized&lt;/li&gt;
  &lt;li&gt;a file that indicates the version of Ipmlab.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Optionally, it’s possible to send some of the user input (PPN identifier or title string) to the corresponding Ipmlab entry widget from an external application through a &lt;a href=&quot;https://en.wikipedia.org/wiki/Network_socket&quot;&gt;socket connection&lt;/a&gt;. This is useful for integrating Ipmlab with e.g. external administrative applications.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2023/01/ipmPostSubmit.png&quot; alt=&quot;Screenshot of Ipmlab user interface. Logging widget shows that Ddrescue is running in the background.&quot; /&gt;
  &lt;figcaption&gt;Ipmlab interface while processing a floppy.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;At the time I’m writing this post, Ipmlab is still undergoing some further tests by colleagues from our digital preservation department. I expect that the outcome of these will result in some further changes to the software. We don’t currently have any reliable information on the layouts of the 3.5 inch floppies in the KB collection, but I expect these will be mostly DOS/Windows formatted ones. There is still a possibility that further tests reveal a significant proportion of more exotic layouts, such as old Mac-formatted disks. In that case, various changes to the workflow would be needed, as these disk types would require the use of  an external hardware controller, and possibly a flux-based imaging approach.&lt;/p&gt;

&lt;h2 id=&quot;adapting-ipmlab&quot;&gt;Adapting Ipmlab&lt;/h2&gt;

&lt;p&gt;Source code and documentation for Ipmlab are available &lt;a href=&quot;https://github.com/KBNLresearch/ipmlab&quot;&gt;here&lt;/a&gt;. Like Iromlab, some features of Ipmlab are quite specific to the situation at the KB. The most obvious example is the use of unique PPN (Pica Production Number) identifiers for each medium, which are linked to records in the KB’s catalog. For a more generic user experience, it’s possible to disable the PPN lookup as a configuration option, which replaces the PPN entry field with a “title” field. The PPN lookup module could also be easily adapted to any other identifier or cataloguing system. Likewise, it’s fairly straightforward to implement alternative disk imaging applications.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgments&quot;&gt;Acknowledgments&lt;/h2&gt;

&lt;p&gt;Thanks are due to Natalia Portillo and Robin François for their help and suggestions on Aaru.&lt;/p&gt;

&lt;h2 id=&quot;additional-links-and-resources&quot;&gt;Additional links and resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/ipmlab&quot;&gt;Ipmlab&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/ipmlab/tree/main/doc&quot;&gt;Ipmlab documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/2020/02/20/offline-digital-carriers-kb-deposit-collection&quot;&gt;Offline digital data carriers in the KB deposit collection&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;These are refurbished “old” drives; this is significant, as modern USB floppy drives often cannot read “double density” disks. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;It’s important to stress that disabling auto-mount does not prevent against intentional write actions: a user can still mount a floppy manually, and write or delete files. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Barring any major changes to Aaru’s command-line interface. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2023/01/23/writing-a-workflow-tool-for-imaging-portable-media</link>
                <guid>https://bitsgalore.org/2023/01/23/writing-a-workflow-tool-for-imaging-portable-media</guid>
                <pubDate>2023-01-23T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>How to preserve your personal Twitter archive</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2022/11/fail_whale_by_ka_92-d5ra7vf.jpg&quot; alt=&quot;Painting of a whale, which is lifted above the beach by a swarm of red Twitter birds. In the foreground, an elderly couple watches the scene.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://web.archive.org/web/20130204212401/https://ka-92.deviantart.com/art/fail-whale-348157275&quot;&gt;&quot;Fail whale&quot;&lt;/a&gt; by &lt;a href=&quot;https://web.archive.org/web/20130128112538/http://ka-92.deviantart.com/&quot;&gt;Kuni (ka-92)&lt;/a&gt; (license unknown), based on &quot;Lifting a Dreamer&quot; by &lt;a href=&quot;http://www.yiyinglu.com/?portfolio=lifting-a-dreamer-aka-twitter-fail-whale&quot;&gt;Yiying Lu&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;As a total collapse of Twitter is becoming more likely every day, many Twitter users have started to archive their personal data from the platform while it still exists. Twitter allows you to request and download your personal archive. Even though this works well, and the quality of the archive is surpringly good, it does have some shortcomings. The result of these shortcomings will be that some information in the archive (e.g. on followed accounts and followers) will be lost once Twitter ceases to exist. Some other information (in particular full, unshortened URLs) &lt;em&gt;is&lt;/em&gt; included in the archive, but it is not easily accessible from the main HTML interface. The good news is, that some excellent tools exist to fix these shortcomings.&lt;/p&gt;

&lt;p&gt;In this post I outline the workflow I used to preserve my own Twitter archive, and while doing so I also provide some background information on the shortcomings of the Twitter archive. Since some of these steps may, at first sight, be a little daunting for less tech-savvy readers, I’ve tried to provide step-by-step instructions where possible.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;disclaimer&quot;&gt;Disclaimer&lt;/h2&gt;

&lt;p&gt;First a little disclaimer: I don’t want to suggest that what follows is the “best”, “proper” or even a “good” way to do this. For most of the issues I’m addressing here, several alternatives exist, and some may be better than what I’m suggesting here. Also, an in-depth analysis of the Twitter archive, and a comparison of different tools and approaches are both beyond the scope of this post (besides I don’t have the time for this). Ultimately, all I’m describing here is a workflow that, for now, looks “good enough” for my own purposes. I’m sharing it here because I &lt;em&gt;think&lt;/em&gt; it will be “good enough” for many others as well, or at least provide a reasonable starting point. With that said, I can’t give any guarantees here, so keep this in mind while reading what follows!&lt;/p&gt;

&lt;h2 id=&quot;request-and-download-twitter-archive&quot;&gt;Request and download Twitter archive&lt;/h2&gt;

&lt;p&gt;Most importantly, you need to request and download your Twitter data&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. The basic steps are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;While logged in to Twitter, go to &lt;a href=&quot;https://twitter.com/settings/account&quot;&gt;your account’s settings&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Click on “Download an archive of your data”, and then simply follow the instructions.&lt;/li&gt;
  &lt;li&gt;Twitter will issue a notification when the archive is ready:
 &lt;img src=&quot;/images/2022/11/archive-notification.png&quot; alt=&quot;&quot; /&gt;
This may take a while. When I did this two weeks ago, I had to wait over 24 hours; various people have told me that currently it take several days. Once ready, you can download the archive as one large ZIP file.&lt;/li&gt;
  &lt;li&gt;Once downloaded, unzip the file to a folder.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;shortcomings-of-the-twitter-archive&quot;&gt;Shortcomings of the Twitter archive&lt;/h2&gt;

&lt;p&gt;As several people have pointed out before, the archive data provided by Twitter have a number of shortcomings. Below a (probably incomplete) overview of the most obvious ones.&lt;/p&gt;

&lt;h3 id=&quot;shortened-tco-links&quot;&gt;Shortened t.co links&lt;/h3&gt;

&lt;p&gt;Perhaps most importantly, if you access your Twitter archive from your web browser by opening the “Your archive.html” file, any clickable hyperlinks you see are shortened t.co links that use &lt;a href=&quot;https://web.archive.org/web/20221116161843/https://help.twitter.com/en/using-twitter/url-shortener&quot;&gt;Twitter’s link shortener service&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2022/11/tweet-link.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;These links will stop working once Twitter goes down. The original, unshortened links are actually stored in the underlying JSON data that are part of the archive (file “tweets.js” in the “data” folder). For example, &lt;a href=&quot;https://gist.github.com/bitsgalore/cfdff3ce67f1ffa85f67e87c778a9e75&quot;&gt;here’s the full data from the Tweet in that screenshot&lt;/a&gt;. Let’s zoom in on its “urls” attribute:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;urls&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
          &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;url&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;https://t.co/y2gpEVvjAd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;expanded_url&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;https://youtu.be/C47ZCosJPAw&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;display_url&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;youtu.be/C47ZCosJPAw&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;indices&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
              &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;246&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
              &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;269&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
            &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
          &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As you can see, the “urls” attribute contains both a shortened t.co link (“url”), the unshortened link (“expanded_url”), and a display link (“display_url”). So all the data are there, but the unshortened links just aren’t accessible from the archive’s web interface.&lt;/p&gt;

&lt;h3 id=&quot;full-size-images&quot;&gt;Full-size images&lt;/h3&gt;

&lt;p&gt;The Twitter archive only contains downscaled versions of posted images. Clicking on an image to expand it takes you to the live Twitter website. Again, this is something that will stop working once Twitter is gone.&lt;/p&gt;

&lt;h3 id=&quot;twitter-network-data&quot;&gt;Twitter network data&lt;/h3&gt;

&lt;p&gt;Although the Twitter archive does contain files with the accounts that you follow and the accounts that follow you, these are given as numerical identifiers that will most likely be meaningless if Twitter disappears. Here’s an example:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;following&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;accountId&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;216697909&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
      &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;userLink&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;https://twitter.com/intent/user?user_id=216697909&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Here, the user link with account ID 216697909 (&lt;a href=&quot;https://twitter.com/intent/user?user_id=216697909&quot;&gt;https://twitter.com/intent/user?user_id=216697909&lt;/a&gt;) resolves to &lt;a href=&quot;https://twitter.com/openpreserve&quot;&gt;the account of the Open Preservation Foundation&lt;/a&gt;. Once Twitter is gone, this user link will stop resolving, which will make it very hard to figure out which actual person or organization was associated with it.&lt;/p&gt;

&lt;p&gt;Fortunately, some excellent solutions exist that can fix most of the above shortcomings, and they are pretty easy to use as well. Let’s start with addressing the network data issue, as this is something you can do immediately, without having to wait for your Twitter archive.&lt;/p&gt;

&lt;h2 id=&quot;preserve-your-twitter-network-with-fedifinder&quot;&gt;Preserve your Twitter network with FediFinder&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://fedifinder.glitch.me/&quot;&gt;FediFinder&lt;/a&gt; is an online tool written by Luca Hammer. Its main purpose is to find Fediverse accounts that correspond to your Twitter connections. However, it also works great for tracking your entire Twitter network, including your followers, accounts that you follow, and list members.&lt;/p&gt;

&lt;p&gt;Just follow these steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;While you’re logged into your Twitter account, go to &lt;a href=&quot;https://fedifinder.glitch.me/&quot;&gt;https://fedifinder.glitch.me/&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Click on the “Authorize to extract handles” button (this will give FediFinder read permissions to your Twitter account):&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/images/2022/11/ff-authorize.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;In the page that appears, click on “Scan followings”, “Scan followers” and “Load lists”. For each list, click on “Scan members”. If all goes well you’ll see something like this:&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/images/2022/11/ff-scan-finished.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Scroll past the list of search results to the bottom of the page, and click on the small “accounts.csv” link&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/images/2022/11/ff-accounts.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once downloaded, you can open the file in any spreadsheet software. For each account, it contains the Twitter user name, the real name, any lists of which the account is a member, associated Fediverse handles, the account’s location, and its profile description.&lt;/p&gt;

&lt;p&gt;Optionally, afterwards you may want to remove FediFinder from your Twitter account’s connected app list using &lt;a href=&quot;https://twitter.com/settings/connected_apps/&quot;&gt;this link&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2022/11/ff-revoke.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;improve-archive-with-twitter-archive-parser&quot;&gt;Improve archive with Twitter archive parser&lt;/h2&gt;

&lt;p&gt;Tim Hutton has written &lt;a href=&quot;https://github.com/timhutton/twitter-archive-parser&quot;&gt;twitter-archive-parser&lt;/a&gt;, which is a Python tool that fixes most of the remaining issues, and some other issues as well&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. Most importantly, it creates both HTML and &lt;a href=&quot;https://en.wikipedia.org/wiki/Markdown&quot;&gt;Markdown&lt;/a&gt; versions of the archive, with all shortened t.co URLs replaced with their original versions. Optionally, it can also be instructed to download full-size versions of images&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;To use it, follow these steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Install Python 3 on your system if you don’t have it already. If you’re a Windows user and you’re not sure how to do this, check out &lt;a href=&quot;https://github.com/pettarin/python-on-windows&quot;&gt;Alberto Pettarin’s easy to follow instructions&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Then download the twitter-archive-parser script. For this, just right-click &lt;a href=&quot;https://raw.githubusercontent.com/timhutton/twitter-archive-parser/main/parser.py&quot;&gt;this link&lt;/a&gt;, select “Save link as”, and save the file into the folder where you extracted the archive&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Start a command-prompt or terminal, and change the working directory to the folder where you extracted the archive. Windows users may want to have another look at &lt;a href=&quot;https://github.com/pettarin/python-on-windows&quot;&gt;Alberto Pettarin’s explainer&lt;/a&gt;, in particular the “Using The Command Prompt” and “Changing The Working Directory” sections.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Run the twitter-archive-parser script from the command-prompt or terminal, using the following command:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; python parser.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Depending on your operating system, you may need to replace “python” with “python3”:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; python3 parser.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;The script will most likely ask you to confirm the installation of one or two Python modules (“requests” and “imagesize”); if this happens, just type “y” and press the Enter key.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Finally, the script will ask if you want to download original-size images. Type “y” if you want this, or “n” if not. Note that downloading the images can take quite a bit of time (depending on the size of your Twitter archive, and the number of images and multimedia it contains).&lt;/p&gt;

    &lt;p&gt;As an aside, I noticed that archive parser was unable to download some images. I don’t particularly care about this myself, but if images are crucially important to you, the “media” folder contains a download log (file “download_log.txt”) with full details of the download status of each image.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And that’s all there is to it!&lt;/p&gt;

&lt;h2 id=&quot;alternative-approaches-and-variations&quot;&gt;Alternative approaches and variations&lt;/h2&gt;

&lt;p&gt;As I already mentioned in the introduction, I’m not making any claims that the above steps are the “proper” way to do this, and various alternative approaches exist. In this section I’ll highlight a few of these.&lt;/p&gt;

&lt;p&gt;After I published the first draft of this post, I found &lt;a href=&quot;https://wiert.me/2022/11/12/exporting-your-twitter-content-converting-to-markdown-and-getting-the-image-alt-texts-thanks-isotopp-hbeckpdx-for-the-info-and-kcgreenn-dreamjar-for-the-comic/&quot;&gt;this earlier post by Jeroen Wiert Pluimers&lt;/a&gt;. This describes an overall workflow that is similar to the one described here (it also uses archive parser), but adds the extraction of  alt-text image descriptions, exporting of bookmarks, and archiving of t.co URL shortener links to the the &lt;a href=&quot;https://web.archive.org/&quot;&gt;Wayback Machine&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Ed Summers has written an &lt;a href=&quot;https://github.com/docnow/twitter-archive-unshorten&quot;&gt;alternative t.co unshortener&lt;/a&gt;, which is explained &lt;a href=&quot;https://inkdroid.org/2022/11/20/t-dot-co/&quot;&gt;in this blog post&lt;/a&gt;. Ed has also &lt;a href=&quot;https://inkdroid.org/2022/11/16/bookmarks/&quot;&gt;written this post on archiving Twitter bookmarks&lt;/a&gt;, which are not included in the archive data. And &lt;a href=&quot;https://gist.github.com/ryanfb/53f167feebde61ad262c4f09d879733e&quot;&gt;here’s a Ruby script by Ryan Baumann&lt;/a&gt; that exports your Twitter Bookmarks to JSON (note that the scripts deletes the original bookmarks to get around API limits). As I never use bookmarks, I’m not really interested in this myself, but this might be important to some users.&lt;/p&gt;

&lt;p&gt;Mike Hucka’s &lt;a href=&quot;https://github.com/mhucka/taupe&quot;&gt;Taupe tool&lt;/a&gt; extracts URLs from tweets, retweets, replies, quote tweets, and “likes” from a personal Twitter archive, and writes these to a comma-delimited text file. This is especially useful if you want to preserve linked resources (e.g. by sending them to a web archive).&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/tweetback/tweetback/&quot;&gt;Tweetback&lt;/a&gt; is an open-source software project by Zach Leatherman that creates a static website from your Twitter archive. See &lt;a href=&quot;https://www.zachleat.com/web/tweetback/&quot;&gt;this blog post&lt;/a&gt; for more information about it, as well as some links to static websites that were created with it. This looks like a really interesting option for those who want to publish their Twitter archive.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/sqiouyilu/twitter-nest&quot;&gt;Twitter-nest&lt;/a&gt; is a set of tools by S. Qiouyi Lu that allow you to create a decentralized Twitter clone on WordPress.&lt;/p&gt;

&lt;p&gt;Internet Archive has also made it possible to upload your Twitter archive to its &lt;a href=&quot;https://archive.org/web/&quot;&gt;Wayback Machine&lt;/a&gt;. See &lt;a href=&quot;https://help.archive.org/help/how-to-archive-your-tweets-with-the-wayback-machine/&quot;&gt;this article for instructions&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally &lt;a href=&quot;https://techcrunch.com/2022/11/21/quit-twitter-better-with-these-free-tools-that-make-archiving-a-breeze/&quot;&gt;this TechCrunch feature&lt;/a&gt; mentions some more tools that might be worth perusing.&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;It’s good to keep in mind that the development of tools like archive parser currently &lt;a href=&quot;https://digipres.club/web/@timhutton@mathstodon.xyz/109377490421206529&quot;&gt;moves at a pretty fast pace&lt;/a&gt;. Just as an example, when I ran archive parser only yesterday (19th of November), it wasn’t able to report Twitter followers and followings, whereas this functionality is included in the latest (20th of November) release. So I expect these tools will become even better over time (but don’t wait for it, as there’s a real chance that Twitter may be gone by then!).&lt;/p&gt;

&lt;p&gt;Please feel free to use the comment section to post links to alternative tools or methods, or if you spot any glaring errors in this post.&lt;/p&gt;

&lt;h2 id=&quot;additional-links-and-resources&quot;&gt;Additional links and resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/timhutton/twitter-archive-parser&quot;&gt;Twitter archive parser&lt;/a&gt; (“related tools” section lists some more more tools that might be useful)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://fedifinder.glitch.me/&quot;&gt;FediFinder&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/pettarin/python-on-windows&quot;&gt;A step-by-step guide on installing Python and using the Command Prompt for Windows&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://inkdroid.org/2022/11/16/bookmarks/&quot;&gt;Ed Summers on archiving Twitter bookmarks&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://ryanfb.github.io/etc/2022/11/21/exporting_as_many_of_your_twitter_bookmarks_as_possible.html&quot;&gt;Ryan Baumann - Exporting As Many of Your Twitter Bookmarks As Possible&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://gist.github.com/ryanfb/53f167feebde61ad262c4f09d879733e&quot;&gt;Ruby script by Ryan Baumann that exports Twitter bookmarks to JSON&lt;/a&gt; (this also deletes the original bookmarks to get around API limits, so use with caution!)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://inkdroid.org/2022/11/20/t-dot-co/&quot;&gt;Alternative t.co unshorten approach by Ed Summers&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/docnow/twitter-archive-unshorten&quot;&gt;twitter-archive-unshorten tool&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://wiert.me/2022/11/12/exporting-your-twitter-content-converting-to-markdown-and-getting-the-image-alt-texts-thanks-isotopp-hbeckpdx-for-the-info-and-kcgreenn-dreamjar-for-the-comic/&quot;&gt;Jeroen Wiert Pluimers on exporting your Twitter content, converting to Markdown and getting the image alt-texts&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/mhucka/taupe&quot;&gt;Taupe tool - extracts URLs from Twitter archive&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://techcrunch.com/2022/11/21/quit-twitter-better-with-these-free-tools-that-make-archiving-a-breeze/&quot;&gt;Quit Twitter better with these free tools that make archiving a breeze (TechCrunch)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/tweetback/tweetback/&quot;&gt;Tweetback by Zach Leatherman&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.zachleat.com/web/tweetback/&quot;&gt;Blog post on Tweetback&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/sqiouyilu/twitter-nest&quot;&gt;Twitter-nest by S. Qiouyi Lu&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://help.archive.org/help/how-to-archive-your-tweets-with-the-wayback-machine/&quot;&gt;How to archive your Tweets with the Wayback Machine&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;21 November 2022: updated info about Twitter’s notification when the archive download is ready.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;21 November 2022: added references to Ryan Baumann’s bookmarks export script.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;21 November 2022: added references to earlier blog post by Jeroen Wiert Pluimers.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;22 November 2022: added references to Taupe tool by Mike Hucka.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;23 November 2022: added references to TechCrunch feature.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;29 November 2022: added references to Tweetback.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;1 December 2022: added references to Twitter-nest.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;15 December 2022: added references to Wayback archiving; renamed and re-strucured final sections.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;More details can be found &lt;a href=&quot;https://help.twitter.com/en/managing-your-account/accessing-your-twitter-data&quot;&gt;here&lt;/a&gt;, although some of the info looks slightly out of date. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;The large “Export fedifinder_accounts.csv” link will give you a file that only includes Fediverse accounts. This can be useful for automating your follows on Mastodon, but if you want detailed information on &lt;em&gt;all&lt;/em&gt; Twitter accounts you (also) need to use the small link! &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Among other things, it also fixes some issues with Direct Messages, which by default don’t include user handles. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;The version of Archive parser I used wasn’t able to produce lists of followers and followings. More recent versions (20 November 2022 an onward) do have this functionality, but the output is less detailed than FediFinder. Also, it doesn’t distinguish between direct follows and follows through a list. So it’s probably a good idea to use FediFinder for this. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;The name of this folder looks something like this:&lt;br /&gt;
  “twitter-2022-11-07-2366bc80316…4e7b77”. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2022/11/20/how-to-preserve-your-personal-twitter-archive</link>
                <guid>https://bitsgalore.org/2022/11/20/how-to-preserve-your-personal-twitter-archive</guid>
                <pubDate>2022-11-20T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Wheel Out the Digital Dark Age Klaxon!</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
&lt;iframe width=&quot;750&quot; height=&quot;422&quot; src=&quot;https://www.youtube-nocookie.com/embed/C47ZCosJPAw&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
&lt;/figure&gt;

&lt;p&gt;Dutch electro outfit the Digital Dark Age Crew are one of the forgotten legends that used to be a mainstay of Rotterdam’s late 90s to mid-2000s underground electro scene. Their music was characterised by relentless electro beats, sparse synth lines, and lyrics that typically commented on the fragility and transience of digital media and digital information in general. In a twisted turn of events, this very theme would eventually define the Digital Dark Age Crew’s own history, ultimately leading to the group’s dramatic demise in 2007. After a fifteen year absence, the Digital Dark Age Crew have now made a long overdue comeback with their new track “Wheel Out the Digital Dark Age Klaxon”, which was released today on the occasion of &lt;a href=&quot;https://www.dpconline.org/events/world-digital-preservation-day&quot;&gt;World Digital Preservation Day 2022&lt;/a&gt;. Time to take a look back at the history of the Digital Dark Age Crew, and their continued relevance today!&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;early-history&quot;&gt;Early history&lt;/h2&gt;

&lt;p&gt;Much is unclear about the exact origins of the Digital Dark Age Crew, and over the years, several of the group’s former members have provided different and often conflicting accounts on this. However, music historians agree that the project was founded in 1997 by Dutch producer Marinus Nullbyte, who ultimately became the sole constant member throughout a succession of many lineup changes. Seen invariably donning a space helmet during live performances and in rare photos of the group, Nullbyte’s true identity remains a secret to this day, and has been the subject of much online speculation among fans.&lt;/p&gt;

&lt;h2 id=&quot;releases&quot;&gt;Releases&lt;/h2&gt;

&lt;p&gt;The Digital Dark Age Crew’s output was released mainly on cheap, home-burned CD-recordables. These would often spontaneously disintegrate within a year after purchase, which somehow added to the group’s mystique. From 2005 onward, they also made their releases available through &lt;a href=&quot;https://en.wikipedia.org/wiki/Myspace&quot;&gt;Myspace&lt;/a&gt;. Fourteen years later, it turned out that these were all lost as a result of Myspace’s infamous &lt;a href=&quot;https://mashable.com/article/myspace-data-loss&quot;&gt;server migration incident&lt;/a&gt;.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2022/11/logo-ddac-blog.png&quot; alt=&quot;Human figure holding large floppy disk, with to its left the words Digital Dark Age Crew&quot; /&gt;
  &lt;figcaption&gt;Digital Dark Age Crew logo.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;burn-a-million-pdfs&quot;&gt;Burn a Million PDFs&lt;/h2&gt;

&lt;p&gt;In their later years, the group shifted their focus to multimedia productions. Without doubt the most notorious of these was 2007’s “Watch the Digital Dark Age Crew Burn a Million PDFs”. While shooting video footage for this project just outside their headquarters, the group accidentally burned down their recording studio, in an unfortunate combination of misjudging the flammability of the source material they were working with, and a sudden change of wind direction. The fire, which resulted in the loss of all the master tapes of their output thus far, effectively marked the end of the Digital Dark Age Crew.&lt;/p&gt;

&lt;h2 id=&quot;digital-dark-age-klaxon&quot;&gt;Digital Dark Age Klaxon&lt;/h2&gt;

&lt;p&gt;Following the group’s dissolution, founding member Marinus Nullbyte turned his attention to a variety of other activities and projects. His fascination with the fragility of digital media continued, and was only further strengthened by the events that led to the downfall of his former group. Around 2016, Nullbyte first came up with the idea of a “digital dark age klaxon”, which was to be an intricate device designed to provide a warning signal in the event of impending digital doom. Details about the inner workings of this digital dark age klaxon remain somewhat nebulous, as Nullbyte has kept the project, which appears to be in a perpetual state of development, under close guard. He maintains though, as he has done for the past six years, that something tangible will be released “soon”.&lt;/p&gt;

&lt;h2 id=&quot;legacy&quot;&gt;Legacy&lt;/h2&gt;

&lt;p&gt;With much of their output gone forever, it’s tempting to think of the Digital Dark Age Crew as a group that is now largely forgotten, except perhaps by a few of their most determined old fans. But dig a little deeper, and it’s clear that their influence is still present today.&lt;/p&gt;

&lt;p&gt;For a start, it may be no coincidence that the term &lt;a href=&quot;https://en.wikipedia.org/wiki/Digital_dark_age&quot;&gt;“Digital Dark Age”&lt;/a&gt; made its &lt;a href=&quot;https://archive.ifla.org/IV/ifla63/63kuny1.pdf&quot;&gt;first appearance in the information science literature&lt;/a&gt; in 1997, the very year the Digital Dark Age Crew appeared on the scene. It’s not a long stretch to assume that the paper’s author took inspiration from both the message and the infectious electro sounds of the Digital Dark Age Crew’s early work. If true, the Digital Dark Age crew may have set in motion a chain of events that ultimately led to “father of the internet” Vint Cerf &lt;a href=&quot;https://www.bbc.com/news/science-environment-31450389&quot;&gt;warning of a coming “digital Dark Age”&lt;/a&gt; in 2015.&lt;/p&gt;

&lt;p&gt;Moreover, Nullbyte’s post-Digital Dark Age Crew work on the digital dark age klaxon has been an ongoing source of puzzlement, speculation and inspiration among digital archivists and other information professionals the world over, and has even spawned &lt;a href=&quot;https://twitter.com/hashtag/WheelOutTheDigitalDarkAgeKlaxon?src=hashtag_click&amp;amp;f=live&quot;&gt;its dedicated hashtag&lt;/a&gt; on Twitter (which, ironically, could &lt;a href=&quot;https://midrange.tedium.co/issues/elon-musk-social-media-cult-of-personality&quot;&gt;itself&lt;/a&gt; be he subject of &lt;a href=&quot;https://davidgerard.co.uk/blockchain/2022/11/03/welcome-to-twitter-mr-musk-heres-your-accordion/&quot;&gt;looming digital disaster&lt;/a&gt; at the time of writing).&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2022/11/m-nullbyte-1014px.jpg&quot; alt=&quot;Marinus Nullbyte, wearing a space helmet while holding a small synthesizer which has a microphone attached to it&quot; /&gt;
  &lt;figcaption&gt;Marinus Nullbyte of the Digital Dark Age Crew during a recording session for &quot;Wheel Out the Digital Dark Age Klaxon&quot;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;comeback&quot;&gt;Comeback&lt;/h2&gt;

&lt;p&gt;Earlier this year, Nullbyte decided to once again put on his old space helmet, and revive the Digital Dark Age Crew project. The first result of this latest incarnation of the group, now essentially a Marinus Nullbyte solo venture, is the new track “Wheel Out The Digital Dark Age Klaxon”. The trademark Digital Dark Age Crew elements are instantly recognisable. Musically, the track offers a fresh take on the classic Digital Dark Age Crew sound. The lyrics intersperse the familiar subject of digital decay with key moments from the history of the group itself. Both themes are held together by a chorus that alludes to the digital dark age klaxon, but which ultimately leaves the listener none the wiser about the specifics of this puzzling device!&lt;/p&gt;

&lt;p&gt;It will be interesting to see whether the return of the Digital Dark Age Crew is a one-off, or perhaps the start of a new string of releases. Whatever the case, for now please enjoy … Wheel Out the Digital Dark Age Klaxon!&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
&lt;iframe width=&quot;750&quot; height=&quot;422&quot; src=&quot;https://www.youtube-nocookie.com/embed/C47ZCosJPAw&quot; title=&quot;YouTube video player&quot; frameborder=&quot;0&quot; allow=&quot;accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture&quot; allowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;
&lt;figcaption&gt;Digital Dark Age Crew -  Wheel Out the Digital Dark Age Klaxon.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Direct link to video on YouTube, in case the embedded player doesn’t work:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://youtu.be/C47ZCosJPAw&quot;&gt;https://youtu.be/C47ZCosJPAw&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Video and audio (FLAC) files are available for download from Zenodo:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/7390809&quot;&gt;https://zenodo.org/record/7390809&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2022/11/03/wheel-out-the-digital-dark-age-klaxon</link>
                <guid>https://bitsgalore.org/2022/11/03/wheel-out-the-digital-dark-age-klaxon</guid>
                <pubDate>2022-11-03T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Identification of physical storage media and devices with Python and the Windows API</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2022/06/storage-media.jpg&quot; alt=&quot;Still life of assorted storage media&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;This blog post covers some techniques that can be used to identify storage media and storage devices using Python and the Windows API. This can be useful for distinguishing between different types of portable storage media, such as floppy disks and USB thumb drives. It also presents a demo script that integrates these techniques.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;preservation-of-portable-storage-media&quot;&gt;Preservation of portable storage media&lt;/h2&gt;

&lt;p&gt;In 2019 the KB started a &lt;a href=&quot;https://www.kb.nl/over-ons/projecten/digitalisering-optische-dragers&quot;&gt;major initiative&lt;/a&gt; to safeguard the information on optical data carriers in its collection, such as CD-ROMs, DVDs and audio CDs. The &lt;a href=&quot;https://github.com/KBNLresearch/iromlab&quot;&gt;Iromlab&lt;/a&gt; software is a central component of the workflow we’re using for this. The optical media preservation project will most likely reach its completion near the end of this year. As the KB collection also contains many other types of portable digital storage media (see &lt;a href=&quot;/2020/02/20/offline-digital-carriers-kb-deposit-collection&quot;&gt;this blog post for an overview&lt;/a&gt;), we’re currently looking into ways to preserve those as well. The first candidate will be 3.5 inch floppy disks. I’m currently working on a derivative of the &lt;em&gt;Iromlab&lt;/em&gt; software that can be used for imaging floppies, but also other types of portable media, such as USB thumb drives. Like &lt;em&gt;Iromlab&lt;/em&gt;, it wraps around &lt;a href=&quot;https://www.isobuster.com/&quot;&gt;IsoBuster&lt;/a&gt; to do the actual imaging. Since I’d like the new software to be able to deal with a variety of physical storage media types and devices, it would be useful to have a way to automatically identify the medium type prior to the imaging stage. On Windows (which is the target environment for this work), this kind of hardware identification is possible through the &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/&quot;&gt;Win32 API&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;challenges&quot;&gt;Challenges&lt;/h2&gt;

&lt;p&gt;In Python, parts of the Win32 API can be accessed using the &lt;a href=&quot;https://github.com/mhammond/pywin32&quot;&gt;pywin32&lt;/a&gt; wrapper extensions, but using these extensions can be a bit of a challenge.There are a couple of reasons for this. First, the pywin32 documentation is incomplete and partially outdated. This means you’ll often have to rely on Microsoft’s documentation of the underlying C++ API. Using the API also involves some fairly low-level operations, such as creating file handles, defining output buffers and parsing binary output. Many of the (few) relevant code examples that can be found online are also outdated, which meant I had to combine examples and documentation from a variety of sources in order to get things working. Taken together, this all makes working with the Win32 API pretty daunting.&lt;/p&gt;

&lt;h2 id=&quot;purpose-of-this-post&quot;&gt;Purpose of this post&lt;/h2&gt;

&lt;p&gt;In this post I’ll try to document how I made basic media and device identification work for me. Given any device attached to a logical Windows drive (e.g. the “A drive”, “D drive”, etcetera), the objectives here are to identify:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;the storage media type (i.e. hard disk, floppy drive, etc.);&lt;/li&gt;
  &lt;li&gt;the hardware device, and the storage media types that are supported by it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I’ll link to the relevant documentation throughout. At the end of this post I also present a simple demo script that ties everything together, and which could be used as a basis for your own code.&lt;/p&gt;

&lt;h2 id=&quot;preparation&quot;&gt;Preparation&lt;/h2&gt;

&lt;p&gt;In order to use the techniques described here, you’ll &lt;a href=&quot;https://github.com/mhammond/pywin32&quot;&gt;pywin32&lt;/a&gt;, which you can install using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;pywin32
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then create a new Python file, and add the following imports:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sys&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;struct&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;argparse&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;win32api&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;win32file&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;winioctlcon&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;device-name&quot;&gt;Device name&lt;/h2&gt;

&lt;p&gt;First of all we need the device name that corresponds to the logical drive we’re interested in (e.g. drive A, C, D, and so on). For this we use (using the “A” drive as an example here):&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;drive&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;A&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Low-level device name of device assigned to logical drive
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;driveDevice&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\\\&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\\&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drive&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;create-file-handle&quot;&gt;Create file handle&lt;/h2&gt;

&lt;p&gt;Next we create a file handle for this device, using the &lt;a href=&quot;http://timgolden.me.uk/pywin32-docs/win32file__CreateFile_meth.html&quot;&gt;win32file.CreateFile&lt;/a&gt; method:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;handle&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;win32file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;CreateFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;driveDevice&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                              &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                              &lt;span class=&quot;n&quot;&gt;win32file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FILE_SHARE_READ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                              &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                              &lt;span class=&quot;n&quot;&gt;win32file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPEN_EXISTING&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                              &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                              &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;retrieve-disk-geometry-info&quot;&gt;Retrieve disk geometry info&lt;/h2&gt;

&lt;p&gt;We can now use the &lt;a href=&quot;http://timgolden.me.uk/pywin32-docs/win32file__DeviceIoControl_meth.html&quot;&gt;win32file.DeviceIoControl&lt;/a&gt; function to obtain information about a physical disk:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;diskGeometry&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;win32file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;DeviceIoControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;handle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                            &lt;span class=&quot;n&quot;&gt;winioctlcon&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IOCTL_DISK_GET_DRIVE_GEOMETRY&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                            &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                            &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this case, the &lt;em&gt;DeviceIoControl&lt;/em&gt; call contains four arguments:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the file handle for the device;&lt;/li&gt;
  &lt;li&gt;the control code &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;winioctlcon.IOCTL_DISK_GET_DRIVE_GEOMETRY&lt;/code&gt; (documented &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ni-winioctl-ioctl_disk_get_drive_geometry&quot;&gt;here&lt;/a&gt;), which tells &lt;em&gt;DeviceIoControl&lt;/em&gt; to retrieves information about a physical disk’s geometry;&lt;/li&gt;
  &lt;li&gt;an input buffer, which is set to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;None&lt;/code&gt; in this case because it is not used;&lt;/li&gt;
  &lt;li&gt;the output buffer size in bytes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output buffer size must be equal to (or larger than) the output that is returned by the function call. The output (a string of raw bytes) is defined by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DISK_GEOMETRY&lt;/code&gt; structure, which is &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ns-winioctl-disk_geometry&quot;&gt;documented here&lt;/a&gt;. It contains 5 fields. The first field is a &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winnt/ns-winnt-large_integer-r1&quot;&gt;large integer&lt;/a&gt; (8 bytes); the remaining fields are all &lt;a href=&quot;https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-dtyp/262627d8-3418-4627-9218-4ffe110850b2&quot;&gt;4-byte unsigned integers&lt;/a&gt;. This means the total size of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DISK_GEOMETRY&lt;/code&gt; structure is 24 bytes, so we use this as the buffer size.&lt;/p&gt;

&lt;h3 id=&quot;parse-disk_geometry&quot;&gt;Parse DISK_GEOMETRY&lt;/h3&gt;

&lt;p&gt;The &lt;em&gt;MediaType&lt;/em&gt; field is the second item of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DISK_GEOMETRY&lt;/code&gt; structure. It is an unsigned integer (4 bytes) that starts at byte offset 8 of our &lt;em&gt;diskGeometry&lt;/em&gt; variable. We can use Python’s &lt;a href=&quot;https://docs.python.org/3/library/struct.html&quot;&gt;struct module&lt;/a&gt; to interpret the raw bytes into an integer value:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mediaTypeCode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;struct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;unpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;lt;I&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;diskGeometry&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here the “&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;I&lt;/code&gt;” format string informs unpack that the bytes represent a little-Endian unsigned integer. Note that &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;struct.unpack&lt;/code&gt; always returns a tuple, hence the “&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;\[0\]&lt;/code&gt;” index at the end.&lt;/p&gt;

&lt;p&gt;The resulting &lt;em&gt;mediaTypeCode&lt;/em&gt; value is an integer number. These numbers can be mapped back to media type strings using the &lt;em&gt;MEDIA_TYPE&lt;/em&gt; enumeration in &lt;a href=&quot;https://github.com/mhammond/pywin32/blob/main/win32/Lib/winioctlcon.py&quot;&gt;winioctlcon.py&lt;/a&gt;. These strings are in turn documented &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ne-winioctl-media_type&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;additional-device-info&quot;&gt;Additional device info&lt;/h2&gt;

&lt;p&gt;Although the above method already allows us to identify, as an example, many types of floppy disks, it cannot distinguish between a USB thumb drive and a CD-ROM drive, both of which are simply identified as “RemovableMedia”. For some more granularity, we can call &lt;a href=&quot;http://timgolden.me.uk/pywin32-docs/win32file__DeviceIoControl_meth.html&quot;&gt;win32file.DeviceIoControl&lt;/a&gt; with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IOCTL_STORAGE_GET_MEDIA_TYPES_EX&lt;/code&gt; control code (documented &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ni-winioctl-ioctl_storage_get_media_types_ex&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;getMediaTypes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;win32file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;DeviceIoControl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;handle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                            &lt;span class=&quot;n&quot;&gt;winioctlcon&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IOCTL_STORAGE_GET_MEDIA_TYPES_EX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                            &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                            &lt;span class=&quot;mi&quot;&gt;2048&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The function arguments are largely identical to the earlier (disk geometry) call, but this time we use a 2048 byte output buffer. This is a somewhat arbitrary value. Unlike &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IOCTL_DISK_GET_DRIVE_GEOMETRY&lt;/code&gt;, the output size of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IOCTL_STORAGE_GET_MEDIA_TYPES_EX&lt;/code&gt; is not fixed (see also the explanation that follows below), so I’m simply using a fairly large value to ensure the buffer will be large enough to fit the output under all circumstances. The output (again a string of raw bytes) is defined by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GET_MEDIA_TYPES&lt;/code&gt; structure, which is &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ns-winioctl-get_media_types&quot;&gt;documented here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;parse-get_media_types&quot;&gt;Parse GET_MEDIA_TYPES&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GET_MEDIA_TYPES&lt;/code&gt; structure is made up of the following fields:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;a 4-byte unsigned integer that represents a device type code;&lt;/li&gt;
  &lt;li&gt;another 4-byte unsigned integer that represents the number of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; structures&lt;/li&gt;
  &lt;li&gt;a pointer to the first &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; structure.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We can read the device type code and the number of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; structures like this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;deviceCode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;struct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;unpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;lt;I&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getMediaTypes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mediaInfoCount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;struct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;unpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;lt;I&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getMediaTypes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The resulting &lt;em&gt;deviceCode&lt;/em&gt; value is an integer number, which again can be mapped back to a device type string using an enumeration in &lt;a href=&quot;https://github.com/mhammond/pywin32/blob/main/win32/Lib/winioctlcon.py&quot;&gt;winioctlcon.py&lt;/a&gt;. The value of &lt;em&gt;mediaInfoCount&lt;/em&gt; tells us the number of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; structures we need to parse.&lt;/p&gt;

&lt;h3 id=&quot;parse-device_media_info&quot;&gt;Parse DEVICE_MEDIA_INFO&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; structure itself is documented &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ns-winioctl-device_media_info&quot;&gt;here&lt;/a&gt;. At first sight this may look a little intimidating, but essentially it just describes a “union” of 3 possible 32-byte structures. This means that each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; instance is made up of either of those structures. For the purpose of this post, we’re only interested in the &lt;em&gt;MediaType&lt;/em&gt; field here, and this field can be at an identical location (the second item of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; structure) for two of the three possible variants. Only in case of a tape device, &lt;em&gt;MediaType&lt;/em&gt; is the first item. Tape devices can be identified from the value of &lt;em&gt;deviceCode&lt;/em&gt;. Using this information, we can use the code below to iterate over all &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DEVICE_MEDIA_INFO&lt;/code&gt; structures, and extract their respective media type codes:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;# Start position in GET_MEDIA_TYPES structure
# (remember we already read two 4-byte integers from it, 
# hence the 8 byte start offset!)
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Loop over DEVICE_MEDIA_INFO structures
&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mediaInfoCount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;deviceCode&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;31&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Tape device, mediaTypeCode is first item
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;mediaTypeCode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;struct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;unpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;lt;I&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                          &lt;span class=&quot;n&quot;&gt;getMediaTypes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;# Not a tape device, so skip 8 byte cylinders value
&lt;/span&gt;        &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;mediaTypeCode&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;struct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;unpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&amp;lt;I&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                          &lt;span class=&quot;n&quot;&gt;getMediaTypes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;

    &lt;span class=&quot;c1&quot;&gt;# Skip to position of next DEVICE_MEDIA_INFO structure
&lt;/span&gt;    &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;24&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The resulting &lt;em&gt;mediaTypeCode&lt;/em&gt; values are integer numbers that can be mapped back to media type strings using the &lt;em&gt;MEDIA_TYPE&lt;/em&gt; and &lt;em&gt;STORAGE_MEDIA_TYPE&lt;/em&gt; enumerations in &lt;a href=&quot;https://github.com/mhammond/pywin32/blob/main/win32/Lib/winioctlcon.py&quot;&gt;winioctlcon.py&lt;/a&gt;. These strings are in turn documented &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ne-winioctl-media_type&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ne-winioctl-storage_media_type&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;putting-it-all-together&quot;&gt;Putting it all together&lt;/h2&gt;

&lt;p&gt;I created a simple &lt;a href=&quot;https://github.com/KBNLresearch/detectStorageMediaType/blob/main/detectStorageMediaType.py&quot;&gt;demo script&lt;/a&gt; that ties everything discussed here together. It also contains lookup functions that translate the media type and device codes into human-readable strings. You can run it from the command prompt with one or more logical drive names as command-line arguments. For example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python detectStorageMediaType.py A D E
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The script then tries to establish the media type (from the disk geometry), the device type and the media types that are supported by the device, and reports the results back like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Drive:                   A
Media type:              F3_1Pt44_512

Drive:                   D
Media type:              RemovableMedia
Device type:             FILE_DEVICE_CD_ROM
Supported media types:
                         CD_ROM
                         RemovableMedia

Drive:                   E
Media type:              RemovableMedia
Device type:             FILE_DEVICE_DISK
Supported media types:
                         RemovableMedia
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You may note that the output is incomplete in some cases, which is because the API calls do not work for all devices. In the above example, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IOCTL_STORAGE_GET_MEDIA_TYPES_EX&lt;/code&gt; control code could not be used to access the floppy drive attached to logical drive &lt;em&gt;A&lt;/em&gt;, so no device output is produced for that drive.&lt;/p&gt;

&lt;p&gt;This example also showcases the level of detail that is reported for floppy disks. As can be seen &lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/winioctl/ne-winioctl-media_type&quot;&gt;here&lt;/a&gt;, the code &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;F3_1Pt44_512&lt;/code&gt; indicates a 3.5” floppy disk, with 1.44MB and 512 bytes per sector.&lt;/p&gt;

&lt;p&gt;Finally, the logical drive &lt;em&gt;D&lt;/em&gt; in the above example was a virtual optical drive with an ISO image of a DVD-ROM attached to it. Despite that, the device is identified as “FILE_DEVICE_CD_ROM” (not “FILE_DEVICE_DVD”!), and DVDs are not listed as a supported media type. This could be a limitation of my test approach, which was based on Windows 10 running in VirtualBox inside a Linux host environment&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. This would need additional testing with physical hardware.&lt;/p&gt;

&lt;h2 id=&quot;final-note&quot;&gt;Final note&lt;/h2&gt;

&lt;p&gt;This post only scratches the surface of what’s possible with the Windows API, but hopefully it will be useful to others to start their own explorations. I’ve only tested the code presented here with a limited number of devices (both physical and virtual ones) attached to a virtual machine running Windows 10. If you run into any surprises, or have suggestions for improvements, feel free to leave a comment here, or open a pull request for the demo script.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectStorageMediaType&quot;&gt;Demo script&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://timgolden.me.uk/pywin32-docs/&quot;&gt;PyWin32 Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://docs.microsoft.com/en-us/windows/win32/api/&quot;&gt;Programming reference for the Win32 API (Microsoft)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;For reasons unknown to me I was unable to attach a physical DVD drive to this virtual machine for my tests. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2022/06/14/identification-of-physical-storage-media-with-python-and-the-windows-api</link>
                <guid>https://bitsgalore.org/2022/06/14/identification-of-physical-storage-media-with-python-and-the-windows-api</guid>
                <pubDate>2022-06-14T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Introducing Isolyzer 1.4</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2022/04/cds.jpg&quot; alt=&quot;Compact Discs still life&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;It’s been a while since the last release of the &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;Isolyzer&lt;/a&gt; tool, but after four years of near-inactivity I just published &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/releases/tag/1.4.0&quot;&gt;Isolyzer 1.4&lt;/a&gt;. In this post I provide some background information on how this release came about, and I briefly explain the main changes.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;history-of-isolyzer&quot;&gt;History of Isolyzer&lt;/h2&gt;

&lt;p&gt;For those unfamiliar with the Isolyzer tool, here’s a brief recap. Isolyzer started its life in 2015. At that time I had just started working on optical media preservation and disc imaging. One of the problems I ran into, was that imaging optical media would occasionally result in incomplete (i.e. truncated) disc images. Worse, I found it near impossible to reliably identify these incomplete disc images with existing software tools. After a lot of digging into the specs, I figured out how to estimate expected file sizes using the block-level information in the ISO 9660 file system. I then applied this in a dedicated Python tool. Since many “ISO images” that exist in the wild are actually hybrids of different file systems (e.g. ISO 9660, Apple HFS or HFS+ and UDF), I gradually added support for those file systems as well. As a result, Isolyzer also became increasingly useful as a tool for extracting information about file systems inside ISO images. More details about Isolyzer’s history can be found &lt;a href=&quot;/2017/01/13/detecting-broken-iso-images-introducing-isolyzer&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;/2017/07/12/update-on-isolyzer-udf-hfs-and-more&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;apple-file-system-block-size-confusion&quot;&gt;Apple file system block size confusion&lt;/h2&gt;

&lt;p&gt;The initial trigger that started the work on this release was an email from Tyler Thorsted. He had run into a number of problematic ISO images of CD-ROMs that had been created with Roxio Toast, a widely used CD-burning software application for Apple Macintosh. Although these images contained an Apple HFS file system with Apple Partition maps, the underlying file systems were not properly identified and parsed by Isolyzer (and some other tools as well). After some poking around with a Hex editor, it turned out that the file system in the offending images was arranged into 2048-byte blocks, instead of the 512-byte blocks that were expected by Isolyzer. This seemed easy enough to fix: instead of assuming a fixed 512-byte block size, I changed the code to use the block size value from the “zero block” structure instead, and then use this to iterate over the partition maps.&lt;/p&gt;

&lt;h2 id=&quot;more-block-size-confusion-and-a-better-fix&quot;&gt;More block size confusion (and a better fix)&lt;/h2&gt;

&lt;p&gt;Sadly, Tyler noticed that the fixed code resulted in missing partition map output for some images that were handled correctly by Isolyzer prior to my fix. For this particular case, the culprit was that the block size value in the “zero block” (which read 2048 bytes) didn’t correspond to the partition map data (which were arranged into 512 byte blocks). The “official” documentation that I could find provided some rather conflicting information on the correct implementation of partition maps, and their relation to the zero block. In the end I worked around this by using a simple iterative procedure that looks for partition maps at a number of pre-defined byte offsets into the image. The resulting match at the first (smallest) byte offset then gives the correct, actual block size. More details and links to sample files can be found in the (now closed) &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/issues/22&quot;&gt;issue on Github&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;high-sierra-file-system&quot;&gt;High Sierra file system&lt;/h2&gt;

&lt;p&gt;Tyler also made a &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/issues/26&quot;&gt;feature request&lt;/a&gt; for support of the &lt;a href=&quot;https://web.archive.org/web/20220111023846/https://www.os2museum.com/files/docs/cdrom/CDROM_Working_Paper-1986.pdf&quot;&gt;“High Sierra” file system&lt;/a&gt;. This was the de facto standard CD-ROM file system for a few years during the late eighties, before it was made redundant by ISO 9660. The High Sierra format is very similar to ISO 9660, and shares the same data structures and fields (although the fields often appear in a different order). This made it relatively easy to add support for it in Isolyzer. As before, see the &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/issues/26&quot;&gt;Github issue&lt;/a&gt; for more details and a link to a sample file.&lt;/p&gt;

&lt;h2 id=&quot;xml-schema&quot;&gt;XML schema&lt;/h2&gt;

&lt;p&gt;I also created an &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/blob/main/xsd/isolyzer-v-1-0.xsd&quot;&gt;XML schema&lt;/a&gt; that makes it possible to validate Isolyzer’s output, and added a namespace definition. This should also make it easier to embed Isolyzer’s output into other XML files, if needed. I should stress that the schema has only had limited testing so far, so please get in contact (or &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/issues/new/choose&quot;&gt;report an issue&lt;/a&gt;) if you come across unexpected behaviour.&lt;/p&gt;

&lt;h2 id=&quot;native-wildcard-expansion-on-non-windows-platforms&quot;&gt;Native wildcard expansion on non-Windows platforms&lt;/h2&gt;

&lt;p&gt;One long-standing annoyance (at least for me) has been Isolyzer’s handling of wildcard expressions at the command-line, which needed to be wrapped in quotation marks on Linux to work correctly. This is rooted in the slightly different ways wildcards are handled by Linux and Windows, respectively. This release fixes this by delegating wildcard expansion to the operating system for non-Windows operating systems. On Windows, Isolyzer still takes care of wildcard expansion (as it did before), since Windows does not do this natively.&lt;/p&gt;

&lt;h2 id=&quot;python-2-no-longer-supported&quot;&gt;Python 2 no longer supported&lt;/h2&gt;

&lt;p&gt;This release also no longer provides support for Python 2 (which is now obsolete), so you’ll have to use Python 3. Windows users can also use the stand-alone binaries, which don’t require Python at all.&lt;/p&gt;

&lt;h2 id=&quot;other-changes&quot;&gt;Other changes&lt;/h2&gt;

&lt;p&gt;In addition to the above changes, this release also includes several minor bug fixes. I have also expanded the set of test files, and set up some unit tests that use these tests files. These changes are all invisible to the user, but they make the Isolyzer release process more straightforward.&lt;/p&gt;

&lt;h2 id=&quot;installation&quot;&gt;Installation&lt;/h2&gt;

&lt;p&gt;For a fresh single-user install with pip use:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;isolyzer &lt;span class=&quot;nt&quot;&gt;--user&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To upgrade an existing version of Isolyzer, use:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;isolyzer &lt;span class=&quot;nt&quot;&gt;--upgrade&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--user&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Alternatively, Windows users can use the binaries that are available from &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/releases/tag/1.4.0&quot;&gt;the release page&lt;/a&gt;. As always, these binaries are completely stand-alone, and don’t require Python on your machine.&lt;/p&gt;

&lt;h2 id=&quot;feedback-welcome&quot;&gt;Feedback welcome&lt;/h2&gt;

&lt;p&gt;As always, feedback on this release is very welcome. Please feel free to &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/issues/new/choose&quot;&gt;report an issue&lt;/a&gt; if anything doesn’t work as expected!&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to Tyler Thorsted for the useful online discussions on the Apple and High Sierra issues, and for providing the necessary test files.&lt;/p&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;7 June 2022: changed all release candidate references to the final 1.4.0 release.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/releases/tag/1.4.0&quot;&gt;Isolyzer 1.4.0 release page&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;Isolyzer on Github (with documentation)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</description>
                <link>https://bitsgalore.org/2022/04/20/introducing-isolyzer-1-4</link>
                <guid>https://bitsgalore.org/2022/04/20/introducing-isolyzer-1-4</guid>
                <pubDate>2022-04-20T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Generating lossy access JP2s from lossless preservation masters</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2022/03/jm-cote-wiki.jpg&quot; alt=&quot;Plumbers Tool Box&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:France_in_XXI_Century._Intencive_breeding.jpg&quot;&gt;Intensive Breeding&lt;/a&gt; by Jean Marc Cote, Public domain, via Wikimedia Commons.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;At the KB we’ve been using JP2 (&lt;a href=&quot;https://en.wikipedia.org/wiki/JPEG_2000&quot;&gt;JPEG 2000&lt;/a&gt; Part 1) as our primary image format for digitised newspapers, books and periodicals since 2007. The digitisation work is contracted out to external vendors, who supply the digitised pages as losslessly compressed preservation masters, as well as lossily compressed access images that are used within the &lt;a href=&quot;https://www.delpher.nl/&quot;&gt;Delpher&lt;/a&gt; platform.&lt;/p&gt;

&lt;p&gt;Right now the KB is in the process of &lt;a href=&quot;https://web.archive.org/web/20210215160819/https://www.kb.nl/en/news/2021/dutch-national-library-steps-into-a-new-future-of-digital-archiving&quot;&gt;migrating its digital collections to a new preservation system&lt;/a&gt;. This prompted the question whether it would be feasible to generate access JP2s from the preservation masters in-house at some point in the future, using software that runs inside the preservation system&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. As a first step towards answering that question, I created some simple proof of concept workflows, using three different JPEG 2000 codecs. I then tested these workflows with preservation master images from our collection. The main objective of this work was to find a workflow that both meets our current digitisation requirements, and is also sufficiently performant.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;master-and-access-requirements&quot;&gt;Master and access requirements&lt;/h2&gt;

&lt;p&gt;The following table lists the requirements of our preservation master and access JP2s:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Parameter&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value (master)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value (access)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File format&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JP2 (JPEG 2000 Part 1)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JP2 (JPEG 2000 Part 1)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Compression type&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Reversible 5-3 wavelet filter&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Irreversible 7-9 wavelet filter&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Colour transform&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes (only for colour images)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes (only for colour images)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of decomposition levels&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Progression order&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RPCL&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RPCL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Tile size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1024 x 1024&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1024 x 1024&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Code block size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64 x 64 (2&lt;sup&gt;6&lt;/sup&gt; x 2&lt;sup&gt;6&lt;/sup&gt;)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64 x 64 (2&lt;sup&gt;6&lt;/sup&gt; x 2&lt;sup&gt;6&lt;/sup&gt;)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Precinct size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;256 x 256 (2&lt;sup&gt;8&lt;/sup&gt;) for 2 highest resolution levels; 128 x 128 (2&lt;sup&gt;7&lt;/sup&gt;) for remaining resolution levels&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;256 x 256 (2&lt;sup&gt;8&lt;/sup&gt;) for 2 highest resolution levels; 128 x 128 (2&lt;sup&gt;7&lt;/sup&gt;) for remaining resolution levels&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of quality layers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;11&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;8&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Target compression ratio layers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2560:1 [1] ; 1280:1 [2] ;  640:1 [3] ; 320:1 [4] ; 160:1 [5] ; 80:1 [6] ; 40:1 [7] ; 20:1 [8] ; 10:1 [9] ; 5:1 [10] ; - [11]&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2560:1 [1] ; 1280:1 [2] ;  640:1 [3] ; 320:1 [4] ; 160:1 [5] ; 80:1 [6] ; 40:1 [7] ; 20:1 [8]&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Error resilience&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Start-of-packet headers; end-of-packet headers; segmentation symbols&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Start-of-packet headers; end-of-packet headers; segmentation symbols&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Sampling rate&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Stored in “Capture Resolution” fields&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Stored in “Capture Resolution” fields&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Capture metadata&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Embedded as XMP metadata in XML box&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Embedded as XMP metadata in XML box&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;As the table shows, most parameters are identical in both cases, except:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The preservation masters are compressed losslessly, whereas for the access images irreversible (lossy) compression is used, using a fixed compression ratio of 20:1.&lt;/li&gt;
  &lt;li&gt;The preservation masters contain 11 quality layers, whereas the access images only contain 8 quality layers.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;deriving-the-access-image&quot;&gt;Deriving the access image&lt;/h2&gt;

&lt;p&gt;In order to derive an access image from a preservation master, two approaches are possible:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create a “subset” from the preservation master that only contains the lower 8 quality layers (i.e. discarding the highest 3 layers).&lt;/li&gt;
  &lt;li&gt;Do a full decode of the preservation master (e.g. to TIFF), and then re-compress the result to lossy JP2.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The main advantage of the first (“subset”) approach is its computational efficiency: it only involves some simple reformatting of data from the source image’s codestream, without any need to decode or compress the image data. I explored this in &lt;a href=&quot;/2013/08/19/optimising-archival-jp2s-derivation-access-copies&quot;&gt;this 2013 blog post&lt;/a&gt;, and at the time I was able to make it work with Kakadu’s “transcode” tool and Aware’s JPEG 20000 SDK&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. However, its success depends largely on the correct implementation of the quality layers in the preservation masters. For example, if the 8th quality layer in a preservation master was accidentally compressed at some other compression ratio than the expected 20:1 value, the resulting access JP2 could be (much) smaller or larger than expected. Complicating things further, even though &lt;a href=&quot;https://jpylyzer.openpreservation.org/&quot;&gt;jpylyzer&lt;/a&gt; will tell you both the overall compression ratio of a JP2 as well as the number of quality layers, it does not provide any information about the compression ratios of individual layers&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Because of this, I only explored the full decode + re-compress approach here. Although computationally less efficient than the “subset” approach, it has the advantage that the result is independent of the implementation of the quality layers in the preservation masters.&lt;/p&gt;

&lt;h2 id=&quot;test-environment&quot;&gt;Test environment&lt;/h2&gt;

&lt;p&gt;For my tests I used an ordinary desktop PC with 4 CPU cores, an Intel i5-6500 CPU (3.20GHz) processor and 12 GB RAM. The operating system was Linux Mint 20.1 Ulyssa (which is based on Ubuntu Focal Fossa 20.04).&lt;/p&gt;

&lt;h2 id=&quot;codecs&quot;&gt;Codecs&lt;/h2&gt;

&lt;p&gt;I initially planned to create a small proof of concept workflow based on &lt;a href=&quot;https://kakadusoftware.com/&quot;&gt;Kakadu&lt;/a&gt;, as I already had some old test scripts for compressing TIFF images to JP2s that follow the KB’s master and access requirements. Then my colleague Sam Alloing suggested to have a look at the &lt;a href=&quot;https://github.com/GrokImageCompression/grok&quot;&gt;Grok codec&lt;/a&gt;. Although I had been aware of Grok for some time, I had never got around to take it for a spin, mainly because I haven’t been working much on anyting related to JPEG 20000 for the past few years. Since Grok is a fork of &lt;a href=&quot;https://www.openjpeg.org/&quot;&gt;OpenJPEG&lt;/a&gt;, which the KB already uses &lt;a href=&quot;https://lab.kb.nl/about-us/blog/kb-national-library-netherlands-adopts-openjpeg-delpher-and-more-0&quot;&gt;to decode JP2 images on the Delpher platform&lt;/a&gt;, it then made sense to include OpenJPEG as well. So, in the end I used:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;OpenJPEG 2.4.2&lt;/li&gt;
  &lt;li&gt;Grok 9.7.3&lt;/li&gt;
  &lt;li&gt;Kakadu 7.9&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I compiled both Grok and OpenJPEG from the source code. For Kakadu I used the pre-compiled demonstration binaries.&lt;/p&gt;

&lt;h2 id=&quot;test-procedure&quot;&gt;Test procedure&lt;/h2&gt;

&lt;p&gt;For each of the three codecs, I created a simple Bash script that takes an input and an output directory as its arguments. For each JP2 image in the input directory, the script goes through the following steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Decode (uncompress) the JP2 to uncompressed TIFF.&lt;/li&gt;
  &lt;li&gt;Compress the TIFF to lossy JP2, using (to the maximum extent posssible) the KB’s access JP2 requirements.&lt;/li&gt;
  &lt;li&gt;Delete the TIFF file.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once all input JP2s have been processed, the script then runs the &lt;a href=&quot;https://github.com/KBNLresearch/jprofile/&quot;&gt;jprofile tool&lt;/a&gt; on the output directory. Jprofile (which uses &lt;a href=&quot;https://jpylyzer.openpreservation.org/&quot;&gt;jpylyzer&lt;/a&gt; under its hood) uses &lt;a href=&quot;https://www.bitsgalore.org/2012/09/04/automated-assessment-jp2-against-technical-profile&quot;&gt;Schematron rules&lt;/a&gt; to verify to what extent the generated JP2s conform to the KB access requirements.&lt;/p&gt;

&lt;p&gt;The test scripts (which also contain the encoding parameter values for each codec) can be found here:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Codec&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Link to script&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;OpenJPEG&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/mastertoaccess-opj.sh&quot;&gt;https://github.com/KBNLresearch/jp2totiff/blob/master/mastertoaccess-opj.sh&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Grok&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/mastertoaccess-grok.sh&quot;&gt;https://github.com/KBNLresearch/jp2totiff/blob/master/mastertoaccess-grok.sh&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Kakadu&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/mastertoaccess-kdu.sh&quot;&gt;https://github.com/KBNLresearch/jp2totiff/blob/master/mastertoaccess-kdu.sh&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;

&lt;p&gt;I ran each of the scripts on a directory with 26 preservation master JP2s (144 MB) from the KB’s collection of digitised books. Before running any of the scripts, I used the following command to empty my machine’s cache memory:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;sysctl vm.drop_caches&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;I then used the operating system’s built-in &lt;a href=&quot;https://linux.die.net/man/1/time&quot;&gt;“time” tool&lt;/a&gt; to measure the processing time needed by each of the scripts:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;time&lt;/span&gt; ~/kb/jp2totiff/mastertoaccess-grok.sh ./master-1 ./access-1-grok&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; 2&amp;gt; time-grok.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The main metrics provided by this command are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;“real” - the actual amount of time passed between starting the script and its termination.&lt;/li&gt;
  &lt;li&gt;“user” - The sum of the processing times of each of the individual processors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below table shows the performance statistics for the three scripts:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Codec&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;time (real)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;time (user)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;OpenJPEG&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m50.715s&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1m20.497s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Grok&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m22.143s&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1m1.308s&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Kakadu&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m25.507s&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0m48.990s&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;It’s worth noting that each of these figures encompasses a full decode-encode cycle, with some additional overhead added by jprofile, and system commands that remove the temorary TIFF files. I was surprised to see that at 22 seconds, the Grok-based script was even (marginally) faster than the Kakadu-based one, which clocks in at 26 seconds&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. The script that uses OpenJPEG is considerably slower at 51 seconds.&lt;/p&gt;

&lt;h2 id=&quot;conformance-to-kb-access-requirements&quot;&gt;Conformance to KB access requirements&lt;/h2&gt;

&lt;p&gt;The next table summarises the jprofile analysis, by listing the deviations from the KB acces requirements for each codec:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Codec&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Deviations from KB access requirements&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;OpenJPEG&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;XML box missing, resolution box missing, ICC profile missing&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Grok&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;XML box missing&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Kakadu&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The OpenJPEG JP2s fall short on three aspects. An XML box with XMP metadata, a resolution box, and an ICC profile are all missing. This is not surprising, as OpenJPEG simply doesn’t support these features at this stage. In the Grok JP2s, only the expected XML box is missing. This is because Grok wraps XMP metadata in a so-called “UUID box”. This behaviour is consistent with the &lt;a href=&quot;https://en.wikipedia.org/wiki/ISO/IEC_base_media_file_format&quot;&gt;ISO/IEC base media file format&lt;/a&gt;, and is supported by e.g. Exiftool and jpylyzer. Only the Kakadu JP2s are 100% compliant with the requirements. However, since the exact location of XMP metadata doesn’t really matter for access, both the Kakadu and the Grok JP2s would be satisfactory for our purposes.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Although based on only a small sample dataset, this proof of concept demonstrates that both Grok and Kakadu would be suitable for generating lossy access JP2s from our preservation masters. The performance of both codecs turned out to be comparable for the test data used. This means that with Grok we now have an open-source codec that is both sufficiently feature-rich and performant to be a viable alternative to commercial codecs like Kakadu. One potential hurdle for some users might be Grok’s build process, which can be slightly involved because it requires very recent versions of &lt;a href=&quot;https://cmake.org/&quot;&gt;CMake&lt;/a&gt; and &lt;a href=&quot;https://gcc.gnu.org/&quot;&gt;gcc&lt;/a&gt;. However, using &lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/doc/grok-installation.md&quot;&gt;Grok’s documentation&lt;/a&gt; and &lt;a href=&quot;https://wiki.harvard.edu/confluence/display/DigitalImaging/Installing+OpenJPEG+on+Windows+10%2C+Linux%2C+and+MacOS&quot;&gt;these useful additional instructions by Harvard’s Bill Comstock&lt;/a&gt; I found the process easier than expected in the end. I’ve &lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/doc/grok-installation.md&quot;&gt;documented the full build and installation process that worked for me here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to Grok developer Aaron Boxer for fixing two small issues I ran into while running my Grok tests, and Sam Alloing for suggesting to look into Grok.&lt;/p&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;5 July 2022 -  re-ran performance test with added &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-threads&lt;/code&gt; option for OpenJPEG, as suggested by Aaron Boxer in the comments.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff&quot;&gt;Git repository with test scripts&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jp2totiff/blob/master/doc/grok-installation.md&quot;&gt;My Grok build and installation instructions&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://wiki.harvard.edu/confluence/display/DigitalImaging/Installing+OpenJPEG+on+Windows+10%2C+Linux%2C+and+MacOS&quot;&gt;Bill Comstock, “Installing OpenJPEG (and Grok) on Windows 10, Linux, and MacOS”&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/2013/08/19/optimising-archival-jp2s-derivation-access-copies&quot;&gt;Optimising archival JP2s for the derivation of access copies&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/jprofile&quot;&gt;Jprofile - Automated JP2 profiling for digitisation batches&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;To be completely clear, at this stage this work is just an exploration of something we might do at some time in the future (or possibly not at all); there are no plans to actually implement this yet. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;As of 2022, Aware appears to have switched its focus to the development of biometrical software, and its website does not mention the JPEG 2000 SDK anymore. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Adding this functionality to jpylyzer would require much more in-depth parsing of the codestream data than is currently the case. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;Note that this a pretty old version. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;These figures are not 100% comparable, because the Kakadu-based script includes an additional processing step to extract embedded metadata from the source file using ExifTool (Grok does this automatically at the codec level). &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2022/03/30/generating-lossy-access-jp2s-from-lossless-preservation-masters</link>
                <guid>https://bitsgalore.org/2022/03/30/generating-lossy-access-jp2s-from-lossless-preservation-masters</guid>
                <pubDate>2022-03-30T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>On The Significant Properties of Spreadsheets</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/09/clippy-800.png&quot; alt=&quot;Clippy saying It looks like you&apos;re migrating a spreadsheet to ... TIFF?!&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Earlier this month saw the publication of &lt;a href=&quot;https://zenodo.org/record/5468116&quot;&gt;The Significant Properties of Spreadsheets&lt;/a&gt;. This is the final report of a six-year research effort by the Open Preservation Foundation’s Archives Interest Group (AIG), which is composed of participants from the National Archives of the Netherlands (NANETH), the National Archives of Estonia (NAE), the Danish National Archives (DNA), and Preservica. The report caught my attention for two reasons. First, there’s the subject matter of spreadsheets, on which I’ve written a few posts in the past&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Second, it marks a surprising (at least to me!) return of “significant properties”, a concept that was omnipresent in the digital preservation world between, roughly, 2005 and 2010, but which has largely fallen into disuse since then. In this post I’m sharing some of my thoughts on the report.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;overall-aim&quot;&gt;Overall aim&lt;/h2&gt;

&lt;p&gt;The authors describe the rationale behind their work as follows:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Preserving files in spreadsheet formats is a priority for every member. We need to answer questions such as ‘should we migrate?’ and ‘how do we measure the success or quality of the migration?’. For the latter, we need to know what aspects of the file are important (significant), which led us to the decision to investigate the significant properties of spreadsheets.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Various definitions of “significant properties” exist. The authors followed the 2007 &lt;a href=&quot;https://significantproperties.kdl.kcl.ac.uk/wp22_significant_properties.pdf&quot;&gt;InSPECT report&lt;/a&gt; here, which defines them as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;the characteristics of digital objects that must be preserved over time in order to ensure the continued accessibility, usability, and meaning of the objects, and their capacity to be accepted as evidence of what they purport to record.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The authors describe the role of “significant properties” in the preservation process as follows:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;When the digital objects (e.g. files in a format) or the technology to use them (e.g. viewers) are at risk of becoming obsolete, preservation actions may be required (e.g. file format migration or viewer software emulation). Ensuring that the significant properties are reasonably preserved as a result of these preservation actions is then a means of validating these actions.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, by this view, the importance of “significant properties” lies in their utility to validate preservation actions.&lt;/p&gt;

&lt;h2 id=&quot;simple-versus-complex-spreadsheets&quot;&gt;Simple versus complex spreadsheets&lt;/h2&gt;

&lt;p&gt;Throughout the report, the authors differentiate between what they call “simple” (or “static”) and “complex” (or “dynamic”) spreadsheets. They define “simple” spreadsheets as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;spreadsheets that are mainly used for (human) visualisation and contain static data values organised into tabular format on one or more worksheets.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Irrespective of whether you agree with the “simple” versus “complex” distinction, the above definition is problematic because it mixes up actual spreadsheet characteristics (“contains static data”) with a subjective statement of how such spreadsheets are supposed to be used (“mainly used for human visualization”). Moreover, it’s not at all clear what the “human visualization” assessment is based on (some written intention statement by the original creators, an institutional policy, or something else?).&lt;/p&gt;

&lt;p&gt;By contrast, “complex” spreadsheets are defined as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;spreadsheets that contain formulae, notes, macros, dates, links to external data sources or other functions or dynamic behaviour.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;migrating-spreadsheets-to-tiff-and-pdfa&quot;&gt;Migrating spreadsheets to TIFF and PDF/A&lt;/h2&gt;

&lt;p&gt;The authors explain that this distinction was motivated by a practical format migration use case. In particular, on page 10 they write that (emphasis added by me):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;simple spreadsheets would likely render more or less the same in most spreadsheet-rendering applications at every moment of time. &lt;strong&gt;One would lose no information when migrating to other, primarily rendering-oriented file formats, like the Tagged Image File Format (TIFF) currently accepted by DNA.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;On page 11, they continue:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt; In case of ‘simple/static’ spreadsheets with formatting (fonts, colours, styles, cell width, etc.), no significant loss of information would occur if the spreadsheets were migrated, to e.g. TIFF or PDF/A.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In short, ‘simple/static’ spreadsheets can be migrated to non-spreadsheet specific file formats or formats that are not meant to preserve dynamic behaviour.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These are truly mind-boggling statements, and I hardly know where to even start here. First, migrating a spreadsheet to an image format like TIFF would result in the loss of &lt;em&gt;most&lt;/em&gt; of the information of the source document, including &lt;em&gt;all&lt;/em&gt; coded text and numbers. This would break machine-readability, and would also make the data values inaccessible to visually impaired users. Most of the functionality of the source file would be lost as well, such as the ability to navigate and access individual cells, rows and columns, search and filter cell values, and copy content, to name but a few. These objections apply to PDF/A as well (albeit to a somewhat lesser degree). I already addressed this &lt;a href=&quot;/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets&quot;&gt;in my 2016 post on PDF/A as a preferred, sustainable format for spreadsheets&lt;/a&gt; (which, incidentally, is even cited in the report). Statements like these show a fundamental lack of understanding of digital information and file formats, and I’m surprised (and frankly a bit shocked) to find them in a contemporary report by digital preservation professionals&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;On “complex” spreadsheets, the authors note that (page 11):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Migrating to non-spreadsheet formats could cause severe information loss.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While this is of course true, this largely applies to “simple” spreadsheets as well. Also, migrating to other spreadsheet formats could result in information loss as well.&lt;/p&gt;

&lt;h2 id=&quot;identification-of-significant-properties&quot;&gt;Identification of significant properties&lt;/h2&gt;

&lt;p&gt;The main outcome of the work is a categorised list of “significant properties” of spreadsheets. The authors arrived at this by adapting the &lt;a href=&quot;https://significantproperties.kdl.kcl.ac.uk/inspect-finalreport.pdf&quot;&gt;methodology from the InSPECT project&lt;/a&gt;. I found the description of the InSPECT methodology and its application to spreadsheets in the AIG report quite hard to follow in places, so in this section I’ll attempt to provide a brief (and somewhat simplified) summary.&lt;/p&gt;

&lt;p&gt;The InSPECT methodology involves two main components. The first one is the “object analysis”. The overall objective here is the compilation of an extensive list of spreadsheet properties, based on sample files, characterisation tools and technical specifications. These were then categorised into “property groups”. Subsequently, the “property groups” were linked to expected “user behaviours” (e.g. “View data in cells”), which were classified into more generic “functions”. The associations between functions, behaviours and property groups are visualised in what the authors call the “spaghetti diagram”, which can be viewed &lt;a href=&quot;https://zenodo.org/record/5468116/files/FBS%20diagram%20%28final%20report%29.png&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A particularly interesting strand of the object analysis work, was the development of a &lt;a href=&quot;https://github.com/RvanVeenendaal/Spreadsheet-Complexity-Analyser&quot;&gt;Spreadsheet Complexity Analyser&lt;/a&gt;. This is a tool that classifies Microsoft Excel spreadsheets (both XLS and XLSX) as “simple” or “complex”, based on extracted document- and cell-level properties (which can be reported as well).&lt;/p&gt;

&lt;p&gt;The second component is a “stakeholder analysis”. Here, the authors presented the properties and property groups from the object analysis to various types of stakeholders (e.g. archivists, users), and asked them which properties they deemed “significant”.&lt;/p&gt;

&lt;p&gt;The authors combined the results from the “object analysis” and the “stakeholder analysis”, which resulted in &lt;a href=&quot;https://zenodo.org/record/5468116/files/Combined%20%28relevant%20and%20significant%20properties%29.xlsx?download=1&quot;&gt;this list&lt;/a&gt;. Out of the 334 properties in the list, 105 were deemed “significant” by the stakeholders at the individual property level. At the property group level, only 15 out of the 38 property groups were deemed “significant”, which corresponds to 140 properties (i.e. all properties that are part of these 15 groups).&lt;/p&gt;

&lt;p&gt;In the following sections I will comment on some things that caught my attention.&lt;/p&gt;

&lt;h2 id=&quot;specificity-of-properties&quot;&gt;Specificity of properties&lt;/h2&gt;

&lt;p&gt;One thing that struck me while browsing the &lt;a href=&quot;https://zenodo.org/record/5468116/files/Combined%20%28relevant%20and%20significant%20properties%29.xlsx?download=1&quot;&gt;list of properties&lt;/a&gt;, is that some of the identified properties are very general and lack specificity. As an example, properties such as “Database Functions”, “Engineering Functions” and “Date and Time Functions” only describe broad &lt;em&gt;categories&lt;/em&gt; of spreadsheet functions, but not the actual functions themselves. I don’t really understand the value of these categories within the context of validating preservation actions (which, after all, is ultimately the larger aim of this work). As an example, imagine we migrate an XLSX spreadsheet to ODS, and for some weird reason the &lt;a href=&quot;https://support.microsoft.com/en-us/office/weekday-function-60e44483-2ed1-439f-8bd0-e404c190949a&quot;&gt;&lt;em&gt;WEEKDAY&lt;/em&gt;&lt;/a&gt; function is changed to &lt;a href=&quot;https://support.microsoft.com/en-us/office/year-function-c64f017a-1354-490d-981f-578e8ec8d3b9&quot;&gt;&lt;em&gt;YEAR&lt;/em&gt;&lt;/a&gt;. Both are “Date and Time Functions”, but the result of such a migration would still be nonsense. What we’d really like to know in this case, is whether &lt;em&gt;the exact functions&lt;/em&gt; are preserved. Moreover, according to &lt;a href=&quot;https://wiki.documentfoundation.org/Feature_Comparison:_LibreOffice_-_Microsoft_Office#Desktop_Spreadsheet_applications:_LibreOffice_Calc_vs._Microsoft_Excel&quot;&gt;the Document Foundation Wiki&lt;/a&gt; both Excel and LibreOffice Calc and have some functions that are unique to each application (29 and 22, respectively). This is a potential source for information loss when migrating between their respective formats. A significant properties approach would only be able to account for such information losses if the properties are defined at the level of individual spreadsheet functions&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.Needless to say, this would increase the number of properties to take into account considerably, and in the absence of any way to automate this it would also make things even more time consuming.&lt;/p&gt;

&lt;p&gt;The properties related to macros&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; provide another example. Microsoft Excel uses &lt;a href=&quot;https://en.wikipedia.org/wiki/Visual_Basic_for_Applications&quot;&gt;Visual Basic for Applications&lt;/a&gt; (VBA) as the macro language for its XLS and XLSX formats. However, the Open Document Format does not dictate any specific macro or scripting language, and simply declares this as &lt;a href=&quot;https://docs.oasis-open.org/office/OpenDocument/v1.3/os/part3-schema/OpenDocument-v1.3-os-part3-schema.html#attribute-script_language&quot;&gt;implementation dependent&lt;/a&gt;. By default, LibreOffice uses its own Basic implementation, which is &lt;a href=&quot;https://help.libreoffice.org/latest/en-US/text/shared/guide/ms_user.html&quot;&gt;largely incompatible with Microsoft’s VBA language&lt;/a&gt;. In addition, it allows the use of other languages such as &lt;a href=&quot;https://wiki.documentfoundation.org/Macros/Python_Guide/Introduction&quot;&gt;Python&lt;/a&gt; and &lt;a href=&quot;https://webodf.org/blog/2012-04-13.html&quot;&gt;JavaScript&lt;/a&gt;. This variety of (mostly incompatible) macro languages introduces various preservation challenges. In spite of this, “macro language” or “scripting language” is not included in the “macros” property group. This raises some questions on how useful these properties actually are for validating preservation actions.&lt;/p&gt;

&lt;h2 id=&quot;how-to-measure-the-properties&quot;&gt;How to measure the properties&lt;/h2&gt;

&lt;p&gt;Related to this, the authors provide no explanation &lt;em&gt;how&lt;/em&gt; any of the identified properties can be measured, and using what software. My best guess is that they used some spreadsheet application&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. If this is indeed the case, how would one even start to do this at scale, with collections of thousands of spreadsheets? Again, the authors don’t even mention this. The Spreadsheet Complexity Analyser could be really useful here, and in fact, DNA tried this approach as part of the object analysis work (p. 19):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The Danish National Archives ran the SCA against about 16,000 Microsoft Excel spreadsheets (both binary formats and OOXML) to investigate the possible information loss when converting Excel spreadsheets to ODS. (…) The test showed that the conversion from XLS and XLSX to ODS and back to XLS and XLSX resulted in minimal data loss. Yet, data loss for significant structures such as cell typographies, fonts and hyperlinks were encountered.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, the Spreadsheet Complexity Analyser is currently only able to extract 13 spreadsheet-specific properties&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, which is only a fraction of all the properties that were deemed “significant”. So while this is an admirable start, it doesn’t even come close to what would be needed for any real-world application.&lt;/p&gt;

&lt;h2 id=&quot;where-are-the-examples-case-studies&quot;&gt;Where are the examples, case studies?&lt;/h2&gt;

&lt;p&gt;The introductory chapter of the AIG report explicitly mentions how the work is aimed at validating preservation actions. However, it is remarkably vague on how the outcomes of the inSPECT methodology could help in solving real-world spreadsheet preservation challenges. The authors do mention application areas such as preservation actions (e.g. format migration and emulation) and choosing preferred formats. However, they don’t provide any examples or case studies that demonstrate how the identified properties, property groups and user behaviours would help here in practice. The closest thing to an actual example is given in the concluding chapter, where they write (p. 39):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;One example is that the Danish National Archives used the gained knowledge of spreadsheet properties in the decision to revise their accepted formats and adopt a spreadsheet-specific format, which probably will be the Open Document Spreadsheet format.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This refers to DNA’s current practice of accepting TIFF(!) as a preservation format for spreadsheets. The authors then write:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[O]ur work provided the lists, tools and insights DNA required to revise their accepted format policy and adopt a spreadsheet-specific format in their Preservation Policy.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It’s not my intention to mock DNA’s odd choice of TIFF here&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;, but that TIFF is a terrible target format for spreadsheets should be blindingly obvious to anyone with even the slightest understanding of spreadsheets or digital formats in general. You really don’t need to spend six years researching hundreds of “significant properties” to arrive at this conclusion&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;!&lt;/p&gt;

&lt;p&gt;Because of the absence of any further examples or case studies, it remains unclear how the authors envisage the use of their work in any real-world format migration or emulation scenario. For a start, the sheer number of properties would make any manual, non-automated analysis extremely cumbersome and time consuming. Interestingly, this is confirmed by the following quote from the DNA stakeholder analysis (p. 18):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Our experiences from the stakeholder interviews were that it can be extremely time and competency consuming to analyse every single property and behaviour for a complex content information type such as spreadsheets. In fact, for these kinds of analyses it can be counterproductive to conducting the interview if we do not try to stray away from the InSPECT approach and instead focus on facilitating a meaningful conversation with people and from this conversation try to deduce the behaviours necessary to preserve for future reuse of the data. The questions you instead can ask people are what they deem important to be able to do with the data and what data and associated functionality do they find important to preserve, if they were to reuse it in the future.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I’m not surprised by this observation, as it is consistent with earlier work by Euan Cochrane. In his 2012 &lt;a href=&quot;https://web.archive.org/web/20130218111126/http://archives.govt.nz/rendering-matters-report-results-research-digital-object-rendering&quot;&gt;Rendering Matters report&lt;/a&gt;, he showed that for a migration-based approach, at least 13.5 hours were needed to test only 100 Office files (a mixture of word-processing, spreadsheet, database and presentation formats) comprehensively. This doesn’t inspire much confidence in using such analyses at scale for any real-world applications.&lt;/p&gt;

&lt;h2 id=&quot;significance-in-context&quot;&gt;Significance in context&lt;/h2&gt;

&lt;p&gt;Some of the problems that I outlined in the previous sections could be remedied to some extent by introducing even more (and more detailed) properties, or by adding support for more properties to the Spreadsheet Complexity Analyser. However, this would make the analyses even more complex and time consuming. This shouldn’t come as a surprise, as Webb, Pearson &amp;amp; Koerbin already pointed out some of these difficulties in their &lt;a href=&quot;http://www.dlib.org/dlib/january13/webb/01webb.html&quot;&gt;2013 D-Lib paper&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We have come to a tentative conclusion that recognising and taking action to maintain significant properties will be critical, but that the concept can be more of a stumbling block than a starting block, at least in the context of our own institution. We believe reference to significant properties in preservation planning requires some prior consideration of both the purposes for which digital content has been collected and the purposes of providing preservation attention. In effect, we are asking how can we know what attributes of digital materials we need to preserve if we haven’t articulated why we are preserving them?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Specifically, they ran into two problems. The first one was the complexity of defining the measurable levels of properties that were required to fulfill a collection’s preservation intention. The second one was the realisation that potential changes to digital objects as a result of some preservation action can best be evaluated around the time the preservation action takes place, using the tools that are then available. This doesn’t mean they discarded the concept of significant properties altogether, but rather they decided that the definition of significant properties is informed by practical experience with (the tools used to perform) preservation actions.&lt;/p&gt;

&lt;p&gt;Owens is also skeptical about the significant properties approach in his 2018 book &lt;a href=&quot;https://osf.io/preprints/lissa/5cpjt&quot;&gt;The Theory and Craft of Digital Preservation&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;At one point, many in the digital preservation field worked to try and identify context agnostic notions of the significant properties of various kinds of digital files and formats. While there is some merit to this approach, it has been largely abandoned in the face of the realization that significance is largely a matter of context. That is, significance isn’t an inherent property of forms of content, but an emergent feature of the context in which that content was created and the purpose it serves in an institutions’ collection.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The over-arching problem of much of the work presented in the AIG report, is the almost total lack of any such context. The result is, that valuable time and effort are spent on the analysis of a myriad of atomic properties, many of which will only matter in certain specific contexts, and not at all in others. So, before embarking on any preservation action, I would like (at the absolute minimum) answers to questions such as:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Who are the (present and future) users of the spreadsheet collections?&lt;/li&gt;
  &lt;li&gt;What accessibility levels are required to meet these users needs (e.g. viewing in original environment, navigating, editing, accessibility for visually impaired users, machine-readability)?&lt;/li&gt;
  &lt;li&gt;What preservation actions are needed to enable these accessibility requirements (e.g. emulation, migration to one or more access formats)?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The answers to such questions largely fit into the “preservation intent” concept, which was (if I’m not mistaken) first introduced by Webb, Pearson &amp;amp; Koerbin:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The Library’s preservation intent methodology is simply to engage collection curators in making explicit statements about which collection materials, and which copies of collection materials, need to remain accessible for an extended period, and which ones can be discarded when no longer in use or when access to them becomes troublesome. Curators are also asked to make broad statements clarifying what ‘accessible’ means by stating the priority elements that need to be re-presented in any future access for each kind of digital object type in their collections.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Owens considers preservation intent to be one of the foundations of any digital preservation work:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In this model, preservation intent and collection development function are the foundation of your digital preservation work. All of the work one does to enable long-term access to digital information should be grounded in a stated purpose and intention that directly informs what is selected and what is retained.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;One important implication is that there is no “one size fits all” solution here. For example, the properties that are “significant” for an authentic rendering in an emulator will be quite different from those for a machine-readable access copy, and such nuances are not apparent from the AIG report, which seems to take the view that digital preservation is merely a matter of preserving the (contextless) “significant properties” (p. 41):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[I]f you look closely at your digital preservation strategy, it boils down to finding the best way to preserve the significant properties of information.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By contrast, taking the preservation intent as a starting point, a practical solution for the above case might look something like this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Always keep the original spreadsheets (in their original formats)&lt;/li&gt;
  &lt;li&gt;Use emulation to ensure faithful rendering of these files in their original environment&lt;/li&gt;
  &lt;li&gt;If needed, use migration to derive access copies in more convenient formats to suit specific use cases (e.g. text and data mining)&lt;sup id=&quot;fnref:10&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This would not require any complex and time-consuming analysis of significant properties, because the original data are preserved in their original formats. The Spreadsheet Complexity Analyser could possibly help validate the migration to the access formats (at least to some degree).&lt;/p&gt;

&lt;h2 id=&quot;representativity-of-stakeholder-group&quot;&gt;Representativity of stakeholder group&lt;/h2&gt;

&lt;p&gt;In the inSPECT methodology, the “significance” that is attributed to a property depends to a large degree on the background and knowledge level of the stakeholders. To their credit, the authors readily acknowledge this in the report, and remark that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A stakeholder that never makes use of more advanced features, such as formulas and macros, will not deem these significant.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A &lt;a href=&quot;https://planets-project.eu/docs/papers/Dappert_Significant_Characteristics_ECDL2009.pdf&quot;&gt;2009 report by Dappert and Farquhar&lt;/a&gt; already stressed how “significance” is largely in the eye of the stakeholder. So, one would expect that the stakeholders would be somewhat representative of the (future) users of the spreadsheet collections. However, the authors restricted their stakeholder analysis population to “individuals that are employed in the public sector”. They justify this by stating that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is due to the fact that the organisations for which this study is carried out, the National Archives of various countries, preserve information from public institutes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I understand the logic behind limiting the size of the stakeholder group (if only for practical reasons), but the justification seems to imply that these National Archives see their depositing institutes as their only (or primary) users. This strikes me as an overly narrow view, as national archives also serve independent researchers, authors, (data) journalists, and members of the general public. It is not clear to what extent the actual (potential) users were represented in the current stakeholder analysis. For instance, a researcher or data journalist may attribute high importance to the ability to access spreadsheet collections through text and data-mining techniques. For a visually impaired user, accessibility through a screenreader application will be vitally important.&lt;/p&gt;

&lt;h2 id=&quot;user-behaviours&quot;&gt;User behaviours&lt;/h2&gt;

&lt;p&gt;In addition, one component of the inSPECT methodology’s stakeholder analysis is the collection of “actual behaviours”, which are defined as “activities that a specific category of stakeholder will likely perform when using the object”. These are then mapped against properties, which eventually determines to a large degree which properties are deemed “significant”. However, the authors decided that this would be too daunting a task (p. 15):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;With spreadsheets, however, this is an immense task. Stakeholders use spreadsheets for a wide range of activities with no established set of functions that has to be used every time. Furthermore, we felt this would be difficult to accomplish thoroughly during interviews with stakeholders, considering the size of the task. Therefore, these last steps were not performed by us during this research.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead, they limited themselves to drafting a list of “expected behaviours”, which inSPECT defines as “the different types of activities that a user – any type of user – may wish to perform”. But even this turned out to be overly ambitious as well (p. 21, emphasis added by me):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;We soon realised that we would never establish an exhaustive list of all possible behaviours and &lt;strong&gt;chose to use those behaviours that we found most important from our perspective as archives preserving spreadsheets&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;significant-to-future-users&quot;&gt;Significant to (future) users?&lt;/h2&gt;

&lt;p&gt;The combined effects of the restricted stakeholder population, not taking into account “actual behaviours”, and limiting the “expected behaviours” to an archivist’s perspective raises some major concerns. Most importantly, it calls into question the degree to which the “significance” judgements truly reflect what is “significant” to actual and future users (which should be informed by the preservation intent). Ultimately, digital preservation is largely about preserving digital materials for future &lt;em&gt;use&lt;/em&gt;. If accounting for actual user behaviours, needs and requirements is too gargantuan a task, shouldn’t this lead to the inevitable conclusion that, at least in this situation, the inSPECT methodology is not suitable as a basis for validating preservation actions&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;?&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The AIG report on the significant properties of spreadsheets leaves me with some very mixed thoughts. On the one hand, it is clear that a lot of time, effort and dedication have gone into this work. The detailed breakdown of properties in the &lt;a href=&quot;https://zenodo.org/record/5468116/files/List%20of%20properties%20and%20property%20groups%20%28blue%20sheet%29.ods?download=1&quot;&gt;“blue sheet”&lt;/a&gt; contains a wealth of information that, I expect, will be of interest to any digital preservationist working on spreadsheet preservation. I particularly like the &lt;a href=&quot;https://github.com/RvanVeenendaal/Spreadsheet-Complexity-Analyser&quot;&gt;Spreadsheet Complexity Analyser&lt;/a&gt;, which looks like a genuinely useful tool.&lt;/p&gt;

&lt;p&gt;With that said, it’s not clear to me how the results of the main body of this work would support its own stated aim of validating preservation actions, and the authors provide no examples of this. It almost seems that they got lost in a labyrinthine web of properties, and lost sight of this overall aim along the way.&lt;/p&gt;

&lt;p&gt;I’m also concerned about the largely context-agnostic view of “significant properties” that the authors express throughout the report. On the one hand, this results in a scope that will be unnecessarily wide for most real-world preservation actions (meaning that the number of properties to consider becomes unmanageable quickly). Simultaneously, the authors made several decisions that, taken together, have the effect that the needs and requirements of (future) users of the spreadsheet collections are only minimally represented. I expect this will seriously limit the utility of this exercise as a basis for making preservation decisions.&lt;/p&gt;

&lt;h2 id=&quot;link-to-report&quot;&gt;Link to report&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/5468116&quot;&gt;The Significant Properties of Spreadsheets (OPF AIG Final Report)&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;8 October 2021: added mention of Euan Cochrane’s Rendering Matters report to examples, case studies section.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;See my posts on &lt;a href=&quot;/2014/10/29/quattro-pro-dos-obsolete-format-last&quot;&gt;Quattro Pro for DOS&lt;/a&gt; and &lt;a href=&quot;/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets&quot;&gt;PDF/A as a preferred, sustainable format for spreadsheets?&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;The discussion of the Danish National Archives Stakeholder Analysis on page 29 provides a more realistic view, which is absolutely clear on why migrating to TIFF is a horrible idea for spreadsheets. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;Also, I cannot even imagine how one would measure these function categories, as to the best of my knowledge they are not even explicitly coded in the spreadsheet. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;I should add here that macros were not considered “significant” by the stakeholders. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;E.g. (some version of) Microsoft Excel, LibreOffice Calc, or perhaps something else? &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;The total number of extracted properties is actually 18, but 5 of these are filesystem-level level attributes (file name, size and time stamps) that are not specific to spreadsheets. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;I suspect this might have it roots in some ill-advised decision in some distant past. I think most memory institutions will have examples of similarly unfortunate historically-evolved practices, and DNA should be praised here for being open about this. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;It’s worth noting here that a footnote on page 31 mentions how DNA are working on revising their format policy because the TIFF format “will not support the migration of dynamic properties”. It also mentions that “[o]ur research has made clear how significant these are and why they must be preserved”. This suggests that despite spending six years of research on the significant properties of spreadsheets, DNA still see no problem with TIFF as a target format for “simple”/”static” spreadsheets. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot;&gt;
      &lt;p&gt;Even migration to some image format might be useful as supporting evidence of how a spreadsheet should be rendered in its original environment (but please use PNG, not TIFF). &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;Which, by the authors’ own account, is the underlying aim of this work. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2021/09/24/on-the-significant-properties-of-spreadsheets</link>
                <guid>https://bitsgalore.org/2021/09/24/on-the-significant-properties-of-spreadsheets</guid>
                <pubDate>2021-09-24T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>PDF processing and analysis with open-source tools</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/09/plumber-tools.jpg&quot; alt=&quot;Photo of assortment of old plumbing tools.&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://www.flickr.com/photos/130648318@N06/42662053232&quot;&gt;Plumbers Tool Box&lt;/a&gt; by &lt;a href=&quot;https://www.flickr.com/photos/130648318@N06/&quot;&gt;pszz&lt;/a&gt; on Flickr. Used under &lt;a href=&quot;https://creativecommons.org/licenses/by-nc-sa/2.0/&quot;&gt;CC BY-NC-SA 2.0&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Over the years, I’ve been using a variety of open-source software tools for solving all sorts of issues with PDF documents. This post is an attempt to (finally) bring together my go-to PDF analysis and processing tools and commands for a variety of common tasks in one single place. It is largely based on a multitude of scattered lists, cheat-sheets and working notes that I made earlier. Starting with a brief overview of some general-purpose PDF toolkits, I then move on to a discussion of the following specific tasks:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Validation and integrity testing&lt;/li&gt;
  &lt;li&gt;PDF/A and PDF/UA compliance testing&lt;/li&gt;
  &lt;li&gt;Document information and metadata extraction&lt;/li&gt;
  &lt;li&gt;Policy/profile compliance testing&lt;/li&gt;
  &lt;li&gt;Text extraction&lt;/li&gt;
  &lt;li&gt;Link extraction&lt;/li&gt;
  &lt;li&gt;Image extraction&lt;/li&gt;
  &lt;li&gt;Conversion to other (graphics) formats&lt;/li&gt;
  &lt;li&gt;Inspection of embedded image information&lt;/li&gt;
  &lt;li&gt;Conversion of multiple images to PDF&lt;/li&gt;
  &lt;li&gt;Cross-comparison of two PDFs&lt;/li&gt;
  &lt;li&gt;Corrupted PDF repair&lt;/li&gt;
  &lt;li&gt;File size reduction of PDF with hi-res graphics&lt;/li&gt;
  &lt;li&gt;Inspection of low-level PDF structure&lt;/li&gt;
  &lt;li&gt;View, search and extract low-level PDF objects&lt;/li&gt;
  &lt;li&gt;Incremental updates and document versions: get information about the number of incremental updates, and restore previous versions&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;how-this-selection-came-about&quot;&gt;How this selection came about&lt;/h2&gt;

&lt;p&gt;Even though this post covers a lot of ground, the selection of tasks and tools presented here is by no means meant to be exhaustive. It was guided to a great degree by the PDF-related issues I’ve encountered myself in my day to day work. Some of these tasks could be done using other tools (including ones that are not mentioned here), and in some cases these other tools may well be better choices. So there’s probably a fair amount of selection bias here, and I don’t want to make any claims of presenting the “best” way to do any of these tasks here. Also, many of the example commands in this post can be further refined to particular needs (e.g. using additional options or alternative output formats), and they should probably best seen as (hopefully useful) starting points for the reader’s own explorations.&lt;/p&gt;

&lt;p&gt;All of the tools presented here are published as open-source, and most of them have a command-line interface. They all work under Linux (which is the main OS I’m using these days), but most of them are available for other platforms (including Windows) as well.&lt;/p&gt;

&lt;h2 id=&quot;pdf-multi-tools&quot;&gt;PDF multi-tools&lt;/h2&gt;

&lt;p&gt;Before diving into any specific tasks, let’s start with some general-purpose PDF tools and toolkits. Each of these are capable of a wide range of tasks (including some I won’t explicitly address here), and they can be seen as “Swiss army-knives” of PDF processing. Whenever I need to get some PDF processing or analysis done and I’m not sure what tool to use, these are usually my starting points. In the majority of cases, at least one of them turns out to have the functionality I’m looking for, so it’s a good idea to check them out if you’re not familiar with them already.&lt;/p&gt;

&lt;h3 id=&quot;xpdfpoppler&quot;&gt;Xpdf/Poppler&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://www.xpdfreader.com/&quot;&gt;Xpdf&lt;/a&gt; and &lt;a href=&quot;https://poppler.freedesktop.org/&quot;&gt;Poppler&lt;/a&gt; are both PDF viewers that include a collection of tools for processing and manipulating PDF files. Poppler is a fork of this software, which adds a number of unique tools that are not part of the original Xpdf package. The tools included with Poppler are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;pdfdetach&lt;/strong&gt;: lists or extracts embedded files (attachments)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdffonts&lt;/strong&gt;: analyzes fonts&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdfimages&lt;/strong&gt;: extracts images&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdfinfo&lt;/strong&gt;: displays document information&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdfseparate&lt;/strong&gt;: page extraction tool&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdfsig&lt;/strong&gt;: verifies digital signatures&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdftocairo&lt;/strong&gt;: converts PDF to PNG/JPEG/PDF/PS/EPS/SVG using the &lt;a href=&quot;https://www.cairographics.org/&quot;&gt;Cairo&lt;/a&gt; graphics library&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdftohtml&lt;/strong&gt;: converts PDF to HTML&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdftoppm&lt;/strong&gt;: converts PDF to PPM/PNG/JPEG images&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdftops&lt;/strong&gt;: converts PDF to PostScript (PS)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdftotext&lt;/strong&gt;: text extraction tool&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;pdfunite&lt;/strong&gt;: document merging tool&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tools in Xpdf are largely identical, but don’t include &lt;em&gt;pdfseparate&lt;/em&gt;, &lt;em&gt;pdfsig&lt;/em&gt;, &lt;em&gt;pdftocairo&lt;/em&gt;, and &lt;em&gt;pdfunite&lt;/em&gt;. Also, Xpdf has a separate &lt;em&gt;pdftopng&lt;/em&gt; tool for converting PDF to PNG images (this functionality is covered by &lt;em&gt;pdftoppn&lt;/em&gt; in the Poppler version). On Debian-based systems the Poppler tools are part of the package &lt;em&gt;poppler-utils&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;pdfcpu&quot;&gt;Pdfcpu&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://pdfcpu.io/&quot;&gt;Pdfcpu&lt;/a&gt; is a PDF processor that is written in the &lt;em&gt;Go&lt;/em&gt; language. The documentation explicity mentions its main focus is strong support for batch processing and scripting via a rich command line. It supports all PDF versions up to PDF 1.7 (ISO-32000).&lt;/p&gt;

&lt;h3 id=&quot;apache-pdfbox&quot;&gt;Apache PDFBox&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://pdfbox.apache.org/&quot;&gt;Apache PDFBox&lt;/a&gt; is an open source Java library for working with PDF documents. It includes a set of &lt;a href=&quot;https://pdfbox.apache.org/2.0/commandline.html&quot;&gt;command-line tools&lt;/a&gt; for various PDF processing tasks. Binary distributions (as &lt;a href=&quot;https://en.wikipedia.org/wiki/JAR_(file_format)&quot;&gt;JAR&lt;/a&gt; packages) are available &lt;a href=&quot;https://pdfbox.apache.org/download.html&quot;&gt;here&lt;/a&gt; (you’ll need the “standalone” JARs).&lt;/p&gt;

&lt;h3 id=&quot;qpdf&quot;&gt;QPDF&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;http://qpdf.sourceforge.net/&quot;&gt;QPDF&lt;/a&gt; is “a command-line program that does structural, content-preserving transformations on PDF files”.&lt;/p&gt;

&lt;h3 id=&quot;mupdf&quot;&gt;MuPDF&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://www.mupdf.com/&quot;&gt;MuPDF&lt;/a&gt; is “a lightweight PDF, XPS, and E-book viewer”. It includes the &lt;a href=&quot;https://www.mupdf.com/docs/index.html&quot;&gt;mutool&lt;/a&gt; utility, which can do a number of PDF processing tasks.&lt;/p&gt;

&lt;h3 id=&quot;pdftk&quot;&gt;PDFtk&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://www.pdflabs.com/tools/pdftk-server/&quot;&gt;PDFtk&lt;/a&gt; (server edition) is a “command-line tool for working with PDFs” that is “commonly used for client-side scripting or server-side processing of PDFs”. More information can be found in the &lt;a href=&quot;https://www.pdflabs.com/docs/pdftk-man-page/&quot;&gt;documentation&lt;/a&gt;, and the &lt;a href=&quot;https://www.pdflabs.com/docs/pdftk-cli-examples/&quot;&gt;command-line examples page&lt;/a&gt;. For Ubuntu/Linux Mint users, the most straightforward installation option is the “pdftk-java” Debian package. This is a Java fork of PDFtk&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h3 id=&quot;ghostscript&quot;&gt;Ghostscript&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://www.ghostscript.com/&quot;&gt;Ghostscript&lt;/a&gt; is “an interpreter for the PostScript language and PDF files”. It provides rendering to a variety of raster and vector formats.&lt;/p&gt;

&lt;p&gt;The remaining sections of this post are dedicated to specific tasks. As you will see, many of these can be addressed using the multi-tools listed in this section.&lt;/p&gt;

&lt;h2 id=&quot;validation-and-integrity-testing&quot;&gt;Validation and integrity testing&lt;/h2&gt;

&lt;p&gt;PDFs that are damaged, structurally flawed or otherwise not conformant to the PDF format specification can result in a multitude of problems. A number of tools provide error checking and integrity testing functionality. This can range from limited structure checks, to full (claimed) validation against the filespec. It’s important to note that none of the tools mentioned here are perfect, and some faults that are picked up by one tool may be completely ignored by another one and vice versa. So it’s often a good idea to try multiple tools. A good example of this approach can be found in &lt;a href=&quot;https://openpreservation.org/blogs/trouble-shooting-pdf-validation-errors-a-case-of-pdf-hul-38/&quot;&gt;this blog post by Micky Lindlar&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;validate-with-pdfcpu&quot;&gt;Validate with Pdfcpu&lt;/h3&gt;

&lt;p&gt;The Pdfcpu command-line tool has a &lt;a href=&quot;https://pdfcpu.io/core/validate&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;validate&lt;/code&gt; command&lt;/a&gt; that checks a file’s compliance against &lt;a href=&quot;https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf&quot;&gt;PDF 32000-1:2008&lt;/a&gt; (i.e. the ISO version of PDF 1.7). It provides both a “strict” and a “relexed” validation mode, where the “relaxed” mode (which is the default!) ignores some common violations of the PDF specification. The command-line is:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfcpu validate whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The “strict” mode can be activated with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-m&lt;/code&gt; option:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfcpu validate &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; strict whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;validate-with-jhove&quot;&gt;Validate with JHOVE&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;http://jhove.openpreservation.org/&quot;&gt;JHOVE&lt;/a&gt; is a is a file format identification, validation and characterisation tool that includes a module for PDF validation. It is widely used in the digital heritage (libraries, archives) sector. Here’s a typical command-line example (note that you explicitly need to invoke the &lt;em&gt;PDF-hul&lt;/em&gt; module via the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-m&lt;/code&gt; option; omitting this can give unexpected results):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;jhove &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; PDF-hul &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Check out the &lt;a href=&quot;https://jhove.openpreservation.org/modules/pdf/&quot;&gt;documentation&lt;/a&gt; for more information about JHOVE’s PDF module, and its limitations.&lt;/p&gt;

&lt;h3 id=&quot;validate-with-arlington-pdf-model-checker&quot;&gt;Validate with Arlington PDF Model Checker&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;https://openpreservation.org/news/arlington-pdf-model-checker-released/&quot;&gt;Arlington PDF Model Checker&lt;/a&gt; checks a PDF against the &lt;a href=&quot;https://github.com/pdf-association/arlington-pdf-model&quot;&gt;Arlington PDF Model&lt;/a&gt;. The Arlington Model is a machine-readable representation of all object types that are defined by &lt;a href=&quot;https://www.pdfa-inc.org/product/iso-32000-2-pdf-2-0-bundle-sponsored-access/&quot;&gt;ISO 32000-2:2020&lt;/a&gt; (PDF 2.0) and all earlier PDF versions. This does not offer full PDF validation&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, but the coverage of PDF objects is vastly more comprehensive than JHOVE (or any other tool I’m aware of)&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. The software is based on VeraPDF (which is discussed further on), and Java installers can be downloaded from &lt;a href=&quot;https://software.verapdf.org/releases/arlington&quot;&gt;VeraPDF’s releases section&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After installation, run the software like this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;arlington-pdf-model-checker whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;By default, the Arlington PDF Model checker tries to automatically establish the PDF version, and then checks the file accordingly. Use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-f&lt;/code&gt; (alias: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--flavour&lt;/code&gt;) option to force a specific version. As an example, the following command will result in validation against PDF 1.4:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;arlington-pdf-model-checker &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; arlington1.4 whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;check-integrity-with-qpdf&quot;&gt;Check integrity with QPDF&lt;/h3&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--check&lt;/code&gt; option of QPDF (see above) performs checks on a PDF’s overall file structure. QPDF does not provide full-fledged validation, and the &lt;a href=&quot;http://qpdf.sourceforge.net/files/qpdf-manual.html&quot;&gt;documentation&lt;/a&gt; states that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A file for which –check reports no errors may still have errors in stream data content but should otherwise be structurally sound&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Nevertheless, QPDF is still useful for detecting various issues, especially in conjunction with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--verbose&lt;/code&gt; option. Here’s an example command-line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qpdf &lt;span class=&quot;nt&quot;&gt;--check&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--verbose&lt;/span&gt; whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;check-for-ghostscript-rendering-errors&quot;&gt;Check for Ghostscript rendering errors&lt;/h3&gt;

&lt;p&gt;Another useful technique is to process a PDF with Ghostscript (rendering the result to a “nullpage” device). For example:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;gs &lt;span class=&quot;nt&quot;&gt;-dNOPAUSE&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-dBATCH&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-sDEVICE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;nullpage whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In case of any problems with the input file, Ghostscript will report quite detailed information. As an example, here’s the output for a PDF with a truncated document trailer:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;   **** Error:  An error occurred while reading an XREF table.
   **** The file has been damaged.  This may have been caused
   **** by a problem while converting or transfering the file.
   **** Ghostscript will attempt to recover the data.
   **** However, the output may be incorrect.
   **** Warning:  There are objects with matching object and generation
   **** numbers.  The output may be incorrect.
   **** Error:  Trailer dictionary not found.
                Output may be incorrect.
   No pages will be processed (FirstPage &amp;gt; LastPage).

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe&apos;s published PDF
   **** specification.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;check-for-errors-with-mutool-info-command&quot;&gt;Check for errors with Mutool info command&lt;/h3&gt;

&lt;p&gt;Running Mutool (part of MuPDF, see above) with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;info&lt;/code&gt; command returns information about internal pdf resources. In case of broken or malformed files the output includes error messages, which can be quite informative. Here’s an example command-line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mutool info whatever.pdf 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;check-for-errors-with-exiftool&quot;&gt;Check for errors with ExifTool&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://exiftool.org/&quot;&gt;ExifTool&lt;/a&gt; is designed for reading, writing and editing meta-information for a plethora of file formats, including PDF. Although it does not do full-fledged validation, it will report error and warning messages for various read issues, and these can be useful for identifying problematic PDFs. For example, here we use ExifTool on a PDF with some internal byte corruption:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exiftool corrupted.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ExifTool Version Number         : 11.88
File Name                       : corrupted.pdf
Directory                       : .
File Size                       : 87 kB
File Modification Date/Time     : 2022:02:07 14:36:47+01:00
File Access Date/Time           : 2022:02:07 14:37:11+01:00
File Inode Change Date/Time     : 2022:02:07 14:36:59+01:00
File Permissions                : rw-rw-r--
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.3
Linearized                      : No
Warning                         : Invalid xref table
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this case the byte corruption results in an  “Invalid xref table” warning. Many other errors and warnings are possible. Check out &lt;a href=&quot;https://openpreservation.org/blogs/pdf-validation-with-exiftool-quick-and-not-so-dirty/&quot;&gt;this blog post by Yvonne Tunnat&lt;/a&gt; which discusses PDF “validation” with ExifTool in more detail.&lt;/p&gt;

&lt;h3 id=&quot;other-options&quot;&gt;Other options&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt; can provide useful information on damaged or invalid PDF documents. However, VeraPDF is primarily aimed at validation against &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/A&quot;&gt;PDF/A&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/UA&quot;&gt;PDF/UA&lt;/a&gt; profiles, which are both subsets of &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF&quot;&gt;ISO 32000&lt;/a&gt; (which defines the PDF format’s full feature set). As a result, VeraPDF’s validation output can be somewhat difficult to interpret for “regular” PDFS (i.e. documents that are not PDF/A or PDF/UA). Nevertheless, experienced users may find VeraPDF useful for such files as well.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Several online resources recommend the &lt;em&gt;pdfinfo&lt;/em&gt; tool that is part of Xpdf and Poppler for integrity checking. However, while writing this post I ran a quick test of the tool on a PDF with a truncated document trailer&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; (which is a very serious flaw), which was not flagged by &lt;em&gt;pdfinfo&lt;/em&gt; at all.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;pdfa-and-pdfua-compliance-testing-with-verapdf&quot;&gt;PDF/A and PDF/UA compliance testing with VeraPDF&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/A&quot;&gt;PDF/A&lt;/a&gt; comprises a set of ISO-standardized profiles that are aimed at long-term preservation. &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/UA&quot;&gt;PDF/UA&lt;/a&gt; is another ISO-standardized profile that ensures accessibility for people with disabilities. These are not separate file formats, but rather profiles within ISO 32000 that put some constraints on PDF’s full set of features. &lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt; was originally developed as an open source PDF/A validator that  covers all parts of the PDF/A standards. Starting with version 1.18, it also added support for PDF/UA. The following command lists al available validation profiles:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;  1a - PDF/A-1A validation profile
  1b - PDF/A-1B validation profile
  2a - PDF/A-2A validation profile
  2b - PDF/A-2B validation profile
  2u - PDF/A-2U validation profile
  3a - PDF/A-3A validation profile
  3b - PDF/A-3B validation profile
  3u - PDF/A-3U validation profile
  ua1 - PDF/UA-1 validation profile
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When running VeraPDF, use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-f&lt;/code&gt; (flavour) option to set the desired validation profile. For example, for PDF/A-1A use something like this&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; 1a whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever-1a.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And for PDF/UA:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; ua1 whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever-ua.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href=&quot;https://docs.verapdf.org/cli/validation/&quot;&gt;documentation&lt;/a&gt; provides more detailed instructions on how to use VeraPDF.&lt;/p&gt;

&lt;h2 id=&quot;document-information-and-metadata-extraction&quot;&gt;Document information and metadata extraction&lt;/h2&gt;

&lt;p&gt;A large number of tools are capable of displaying or extracting technical characteristics and various kinds of metadata, with varying degrees of detail. I’ll only highlight a few here.&lt;/p&gt;

&lt;h3 id=&quot;extract-general-characteristics-with-pdfinfo&quot;&gt;Extract general characteristics with pdfinfo&lt;/h3&gt;

&lt;p&gt;The &lt;em&gt;pdfinfo&lt;/em&gt; tool that is part of Xpdf and Poppler is useful for a quick overview of a document’s general characteristics. The basic command line is:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfinfo whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which gives the following result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Creator:        PdfCompressor 3.1.32
Producer:       CVISION Technologies
CreationDate:   Thu Sep  2 07:52:56 2021 CEST
ModDate:        Thu Sep  2 07:53:20 2021 CEST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      439.2 x 637.92 pts
Page rot:       0
File size:      24728 bytes
Optimized:      yes
PDF version:    1.6
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;extract-metadata-with-apache-tika&quot;&gt;Extract metadata with Apache Tika&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt; is a Java library that supports metadata and content extraction for a wide variety of file formats. For command-line use, download the &lt;em&gt;Tika-app&lt;/em&gt; runnable JAR from &lt;a href=&quot;https://tika.apache.org/download.html&quot;&gt;here&lt;/a&gt;. By default, Tika will extract both text and metadata, and report both in XHTML format. Tika has several command-line options that this behaviour. A basic metadata extraction command is (you may need to adapt the path and name of the JAR file)):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/tika/tika-app-2.1.0.jar &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Content-Length: 24728
Content-Type: application/pdf
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By: org.apache.tika.parser.pdf.PDFParser
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.6
dcterms:created: 2021-09-02T05:52:56Z
dcterms:modified: 2021-09-02T05:53:20Z
pdf:PDFVersion: 1.6
pdf:charsPerPage: 0
pdf:docinfo:created: 2021-09-02T05:52:56Z
pdf:docinfo:creator_tool: PdfCompressor 3.1.32
pdf:docinfo:modified: 2021-09-02T05:53:20Z
pdf:docinfo:producer: CVISION Technologies
pdf:encrypted: false
pdf:hasMarkedContent: false
pdf:hasXFA: false
pdf:hasXMP: true
pdf:producer: CVISION Technologies
pdf:unmappedUnicodeCharsPerPage: 0
resourceName: whatever.pdf
xmp:CreateDate: 2021-09-02T07:52:56Z
xmp:CreatorTool: PdfCompressor 3.1.32
xmp:MetadataDate: 2021-09-02T07:53:20Z
xmp:ModifyDate: 2021-09-02T07:53:20Z
xmpMM:DocumentID: uuid:2ec84d65-f99d-49fe-9aac-bd0c1fff5e66
xmpTPg:NPages: 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Tika offers several options for alternative output formats (e.g. XMP and JSON); these are all &lt;a href=&quot;https://tika.apache.org/2.1.0/gettingstarted.html&quot;&gt;explained here&lt;/a&gt; (section “Using Tika as a command line utility”).&lt;/p&gt;

&lt;h3 id=&quot;extract-metadata-with-exiftool&quot;&gt;Extract metadata with ExifTool&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://exiftool.org/&quot;&gt;ExifTool&lt;/a&gt; is another good option for metadata extraction. Here’s an example:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;exiftool whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ExifTool Version Number         : 11.88
File Name                       : whatever.pdf
Directory                       : .
File Size                       : 24 kB
File Modification Date/Time     : 2021:09:02 12:23:32+02:00
File Access Date/Time           : 2022:02:07 15:04:11+01:00
File Inode Change Date/Time     : 2021:09:02 15:27:38+02:00
File Permissions                : rw-rw-r--
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.6
Linearized                      : Yes
Create Date                     : 2021:09:02 07:52:56+02:00
Creator                         : PdfCompressor 3.1.32
Modify Date                     : 2021:09:02 07:53:20+02:00
XMP Toolkit                     : Adobe XMP Core 5.6-c017 91.164464, 2020/06/15-10:20:05
Metadata Date                   : 2021:09:02 07:53:20+02:00
Creator Tool                    : PdfCompressor 3.1.32
Format                          : application/pdf
Document ID                     : uuid:2ec84d65-f99d-49fe-9aac-bd0c1fff5e66
Instance ID                     : uuid:28d0af59-9373-4358-88f2-c8c4db3915ed
Producer                        : CVISION Technologies
Page Count                      : 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;ExifTool can also write the extracted metadata to a variety of output formats, which is explained in the documentation.&lt;/p&gt;

&lt;h3 id=&quot;extract-metadata-from-embedded-documents&quot;&gt;Extract metadata from embedded documents&lt;/h3&gt;

&lt;p&gt;One particularly useful feature of Tika is its ability to deal with embedded documents. As an example, &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/digitally_signed_3D_Portfolio.pdf&quot;&gt;this file&lt;/a&gt; is a &lt;a href=&quot;https://helpx.adobe.com/acrobat/using/overview-pdf-portfolios.html&quot;&gt;PDF portfolio&lt;/a&gt;, which can contain multiple files and file types. Invoking Tika with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-J&lt;/code&gt; (“output metadata and content from all embedded files”) option results in JSON-formatted output that contains metadata (and also extracted text) for all for all files that are embedded in this document:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/tika/tika-app-2.1.0.jar &lt;span class=&quot;nt&quot;&gt;-J&lt;/span&gt; digitally_signed_3D_Portfolio.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.json
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;elaborate-feature-extraction-with-verapdf&quot;&gt;Elaborate feature extraction with VeraPDF&lt;/h3&gt;

&lt;p&gt;Although primarily aimed at PDF/A validation, &lt;a href=&quot;https://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt; can also be used as a powerful metadata and feature extractor for any PDF file (including files that don’t follow the PDF/A or PDF/UA at all!). By default, VeraPDF is configured to only extract metadata from a PDF’s information dictionary, but this behaviour can be easily changed by modifying a configuration file, which is &lt;a href=&quot;https://docs.verapdf.org/cli/config/#features.xml&quot;&gt;explained in the documentation&lt;/a&gt;. This enables you to obtain detailed information about things like Actions, Annotations, colour spaces, document security features (including encryption), embedded files, fonts, images, and much more. Then use a command line like&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf --off --extract whatever.pdf &amp;gt; whatever.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;VeraPDF can also be used to recursively process all files with a .pdf extension in a directory tree, using the following command-line (here, &lt;em&gt;myDir&lt;/em&gt; is the root of the directory tree):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf --recurse --off --extract myDir &amp;gt; whatever.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;a href=&quot;https://docs.verapdf.org/cli/feature-extraction/&quot;&gt;VeraPDF documentation&lt;/a&gt; discusses the feature extraction functionality in more detail.&lt;/p&gt;

&lt;h2 id=&quot;policy-or-profile-compliance-assessment-with-verapdf&quot;&gt;Policy or profile compliance assessment with VeraPDF&lt;/h2&gt;

&lt;p&gt;The results of the feature extraction exercise described in the previous section can also be used as input for policy-based assessments. For instance, archival institutions may have policies that prohibit e.g. PDFs with encryption or fonts that are not embedded. This can also be done with VeraPDF. This requires that the rules that make up the policy are expressed as a machine-readable &lt;a href=&quot;https://en.wikipedia.org/wiki/Schematron&quot;&gt;Schematron&lt;/a&gt; file. As an example, the Schematron file below is made up of two rules that each prohibit specific encryption-related features:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot;?&amp;gt;&lt;/span&gt;

&lt;span class=&quot;nt&quot;&gt;&amp;lt;sch:schema&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;xmlns:sch=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://purl.oclc.org/dsdl/schematron&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;queryBinding=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;xslt&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sch:pattern&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Disallow encrypt in trailer dictionary&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;sch:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/report/jobs/job/featuresReport/documentSecurity&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;sch:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;not(encryptMetadata = &apos;true&apos;)&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Encrypt in trailer dictionary&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sch:assert&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;/sch:rule&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/sch:pattern&amp;gt;&lt;/span&gt;    
    
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sch:pattern&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Disallow other forms of encryption (e.g. open password)&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;sch:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/report/jobs/job/taskResult/exceptionMessage&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;sch:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;not(contains(.,&apos;encrypted&apos;))&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Encrypted document&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sch:assert&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;/sch:rule&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/sch:pattern&amp;gt;&lt;/span&gt;
    
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sch:schema&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;A PDF can subsequently be tested against these rules (here in the file “policy.sch”) using the following basic command-line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf &lt;span class=&quot;nt&quot;&gt;--extract&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--policyfile&lt;/span&gt; policy.sch whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The outcome of the policy-based assessment can be found in the output file’s &lt;em&gt;policyReport&lt;/em&gt; element. In the example below, the PDF did not meet one of the rules:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;policyReport&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;passedChecks=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;failedChecks=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;xmlns:vera=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.verapdf.org/MachineReadableReport&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;passedChecks/&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;failedChecks&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;check&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;status=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;failed&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;not(encryptMetadata = &apos;true&apos;)&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;location=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/report/jobs/job/featuresReport/documentSecurity&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;message&amp;gt;&lt;/span&gt;Encrypt in trailer dictionary&lt;span class=&quot;nt&quot;&gt;&amp;lt;/message&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;/check&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/failedChecks&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/policyReport&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;More examples can be found in my 2017 post &lt;a href=&quot;/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression&quot;&gt;Policy-based assessment with VeraPDF - a first impression&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;text-extraction&quot;&gt;Text extraction&lt;/h2&gt;

&lt;p&gt;Text extraction from PDF documents is notoriously hard. &lt;a href=&quot;https://filingdb.com/b/pdf-text-extraction&quot;&gt;This post&lt;/a&gt; gives a good overview of the main pitfalls. Tim Allison’s excellent &lt;a href=&quot;https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf&quot;&gt;Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction&lt;/a&gt; provides a more in-depth discussion, and this really is a must-read for anyone seriously interested in this subject. With that said, quite a few tools are available, and below I list a few that are useful starting points.&lt;/p&gt;

&lt;h3 id=&quot;extract-text-with-pdftotext&quot;&gt;Extract text with pdftotext&lt;/h3&gt;

&lt;p&gt;The &lt;em&gt;pdftotext&lt;/em&gt; tool that is part of Poppler and Xpdf is a good starting point. The basic command-line is:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdftotext whatever.pdf whatever.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The tool has lots of options to fine-tune the default behaviour, so make sure to check those out if you’re looking for. Note that the available options vary somewhat between the Poppler and Xpdf versions. The &lt;a href=&quot;https://manpages.debian.org/stretch/poppler-utils/pdftotext.1.en.html&quot;&gt;documentation of the Poppler version is available here&lt;/a&gt;, and &lt;a href=&quot;https://www.xpdfreader.com/pdftotext-man.html&quot;&gt;here is the Xpdf version&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;extract-text-with-pdfbox&quot;&gt;Extract text with PDFBox&lt;/h3&gt;

&lt;p&gt;PDFBox is also a good choice for text extraction. Here’s an example command (you may need to adapt the path to the JAR file and its name according to the location and version on your system):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/pdfbox/pdfbox-app-2.0.24.jar ExtractText whatever.pdf whatever.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;PDFBox also provides various options, which are &lt;a href=&quot;https://pdfbox.apache.org/1.8/commandline.html#extracttext&quot;&gt;documented here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;extract-text-with-apache-tika&quot;&gt;Extract text with Apache Tika&lt;/h3&gt;

&lt;p&gt;I already mentioned &lt;a href=&quot;https://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt; in the metadata extraction section. Tika is also a powerful text extraction tool, and it is particularly useful for situations where text extraction from multiple input formats is needed. For PDF it uses the PDF parser of PDFBox (see previous section). By default, Tika extracts both text and metadata, and reports both in XHTML format. If needed, you can change this behaviour with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--text&lt;/code&gt; option:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/tika/tika-app-2.1.0.jar &lt;span class=&quot;nt&quot;&gt;--text&lt;/span&gt; whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Again, an explanation of all available options is &lt;a href=&quot;https://tika.apache.org/2.1.0/gettingstarted.html&quot;&gt;available here&lt;/a&gt; (section “Using Tika as a command line utility”).&lt;/p&gt;

&lt;h3 id=&quot;batch-processing-with-tika&quot;&gt;Batch processing with Tika&lt;/h3&gt;

&lt;p&gt;The above single-file command does not scale well for situations that require the processing of large volumes of PDFs&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;. In such cases, it’s better to run Tika in batch mode. As an example, the command below will process all files in directory “myPDFs”, and store the results in output directory “tika-out”&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/tika/tika-app-2.1.0.jar &lt;span class=&quot;nt&quot;&gt;--text&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; ./myPDFs/ &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; ./tika-out/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Alternatively, you could use TikaServer. A &lt;a href=&quot;https://tika.apache.org/download.html&quot;&gt;runnable JAR is available here&lt;/a&gt;. To use it, first start the server using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/tika/tika-server-standard-2.1.0.jar
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Once the server is running, use &lt;a href=&quot;https://en.wikipedia.org/wiki/CURL&quot;&gt;cURL&lt;/a&gt; (from another terminal window) to submit text extraction requests:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;curl &lt;span class=&quot;nt&quot;&gt;-T&lt;/span&gt; whatever.pdf http://localhost:9998/tika &lt;span class=&quot;nt&quot;&gt;--header&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;Accept: text/plain&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The full TikaServer documentation is &lt;a href=&quot;https://cwiki.apache.org/confluence/display/TIKA/TikaServer&quot;&gt;available here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Yet another option is &lt;a href=&quot;https://github.com/chrismattmann/tika-python&quot;&gt;Tika-python&lt;/a&gt;, which is a Python port of Tika that uses TikaServer under the hood (resulting in similar performance).&lt;/p&gt;

&lt;h2 id=&quot;link-extraction&quot;&gt;Link extraction&lt;/h2&gt;

&lt;p&gt;When extracting (hyper)links, it’s important to make a distinction between the following two cases:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Links that are encoded as a “link annotation”, which is a data structure in PDF that results in a clickable link&lt;/li&gt;
  &lt;li&gt;Non-clickable links/URLs that are just part of the body text.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The automated extraction of the first case is straightforward, while the second case depends on some kind of lexical analysis of the body text (typically based on regular expressions). For most practical applications the extraction of both types is desired.&lt;/p&gt;

&lt;h3 id=&quot;extract-links-with-pdfx&quot;&gt;Extract links with pdfx&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.metachris.com/pdfx/&quot;&gt;pdfx&lt;/a&gt; tool is designed to detect and extract external references, including URLs. Its URL detection uses lexical analysis, and is based on &lt;a href=&quot;https://gist.github.com/gruber/8891611&quot;&gt;RegEx patterns written by John Gruber&lt;/a&gt;. The basic command line for URL extraction is:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfx &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; whatever.pdf &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; whatever.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I did some limited testing with this tool in 2016. One issue I ran into is that pdfx &lt;a href=&quot;https://github.com/metachris/pdfx/issues/21&quot;&gt;truncates URLS that span more than one line&lt;/a&gt;. As of 2021, this issue hasn’t been fixed so far, which seriously limits the usefulness of this (otherwise very interesting) tool. It’s worth mentioning that &lt;em&gt;pdfx&lt;/em&gt; also provides functionality to automatically download all referenced PDFs from any PDF document. I haven’t tested this myself.&lt;/p&gt;

&lt;h3 id=&quot;other-link-extraction-tools&quot;&gt;Other link extraction tools&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Some years ago Ross Spencer wrote &lt;a href=&quot;https://github.com/httpreserve/tikalinkextract&quot;&gt;a link extraction tool that uses Apache Tika&lt;/a&gt;. There’s more info in &lt;a href=&quot;https://openpreservation.org/blogs/hyperlinks-in-your-files-how-to-get-them-out-using-tikalinkextract/&quot;&gt;this blog post&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Around the same time I wrote &lt;a href=&quot;https://gist.github.com/bitsgalore/aab680a9bccfc5496948b776ee06397c&quot;&gt;this simple extraction script&lt;/a&gt; that wraps around Apache Tika and the &lt;a href=&quot;https://github.com/mvdan/xurls&quot;&gt;xurl&lt;/a&gt; tool. I used this to extract URLs from MS Word documents, but this should probably work for PDF too (I haven’t tested this though!).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;image-extraction-with-pdfimages&quot;&gt;Image extraction with pdfimages&lt;/h2&gt;

&lt;p&gt;PDFs often contain embedded images, which can be extracted with &lt;em&gt;pdfimages&lt;/em&gt; tool that is part of Xpdf/Poppler. At minimum, it takes as its arguments the name of the input PDF document, and the “image-root” which is actually just a text prefix that is used to generate the name of the output images. By default it writes its output to one of the &lt;a href=&quot;https://en.wikipedia.org/wiki/Netpbm&quot;&gt;Netpbm&lt;/a&gt; file formats, but for convenience you might want to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-png&lt;/code&gt; option, which uses the PNG format instead:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfimages &lt;span class=&quot;nt&quot;&gt;-png&lt;/span&gt; whatever.pdf whatever
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Output images are now written as “whatever-000.png”, “whatever-001.png”, “whatever-002.png”, and so on. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-j&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-jp2&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-jbig2&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-ccitt&lt;/code&gt; switches can be used to store JPEG, JPEG2000, JBIG2 and CCITT images in their native formats, respectively (or use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-all&lt;/code&gt;, which combines all of these options).&lt;/p&gt;

&lt;!--
### PDFBox

```bash
java -jar ~/pdfbox/pdfbox-app-2.0.24.jar ExtractImages whatever.pdf
```

This gave me the following error for a PDF with an embedded JPEG 2000 image:

```
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
```

This is a [known bug](https://issues.apache.org/jira/browse/PDFBOX-4681)

--&gt;

&lt;h2 id=&quot;conversion-to-other-graphics-formats-with-pdftocairo&quot;&gt;Conversion to other (graphics) formats with pdftocairo&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;pdftocairo&lt;/em&gt; tool (Xpdf/Poppler ) can convert a PDF to a number of (mostly graphics) formats. The supported output formats are PNG, JPEG, TIFF, PostScript, Encapsulated PostScript, Scalable Vector Graphics and PDF. As an example, the following command will convert each page to a PNG image:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdftocairo &lt;span class=&quot;nt&quot;&gt;-png&lt;/span&gt; whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;list-embedded-image-information-with-pdfimages&quot;&gt;List embedded image information with pdfimages&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;pdfimages&lt;/em&gt; tool is also useful for getting an overview of all embedded images in a PDF, and their main characteristics (width, height, colour, encoding, resolution and size). just user the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-list&lt;/code&gt; option as shown below:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfimages &lt;span class=&quot;nt&quot;&gt;-list&lt;/span&gt; whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results in a nice table like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;page   num  type   width height color comp bpc  enc interp
-----------------------------------------------------------
   1     0 image    1830  2658  gray    1   1  jbig2  no
   1     1 image     600   773  gray    1   8  jpx    no



page   object ID x-ppi y-ppi size ratio
----------------------------------------
   1       16  0   301   301   99B 0.0%
   1       17  0   300   300 17.9K 4.0%
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;conversion-of-multiple-image-files-to-pdf&quot;&gt;Conversion of multiple image files to PDF&lt;/h2&gt;

&lt;h3 id=&quot;losslessly-convert-raster-images-to-pdf-with-img2pdf&quot;&gt;Losslessly convert raster images to pdf with img2pdf&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;https://gitlab.mister-muffin.de/josch/img2pdf&quot;&gt;img2pdf&lt;/a&gt; tool converts a list of image files to PDF. Unlike several other tools (such as ImageMagick), it does not re-encode the source images, but simply embeds them as PDF objects in their original formats. This means that the conversion is always lossless. The following example shows how to convert three &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/JP2&quot;&gt;JP2 (JPEG 200 Part 1)&lt;/a&gt; images:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;img2pdf image1.jp2 image2.jp2 image3.jp2 &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the resulting PDF, each image is embedded as an image stream with the JPXDecode (JPEG 2000) filter.&lt;/p&gt;

&lt;h2 id=&quot;pdf-comparison-with-comparepdf&quot;&gt;PDF comparison with Comparepdf&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.qtrac.eu/&quot;&gt;Comparepdf&lt;/a&gt;&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;9&lt;/a&gt;&lt;/sup&gt; tool compares pairs of PDFs, based on either text or visual appearance. By default it uses the program exit code to store the result of the comparison. The tool’s command-line help text explains the possible outcomes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A return value of 0 means no differences detected; 1 or 2 signifies an error; 10 means they differ visually, 13 means they differ textually, and 15 means they have different page counts&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For clarity I used the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-v&lt;/code&gt; switch in the examples below, which activates verbose output. To test if two PDFs contain the same text, use:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;comparepdf &lt;span class=&quot;nt&quot;&gt;-ct&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2 whatever.pdf wherever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If al goes well the output is either “No differences detected” or “Files have different texts”.&lt;/p&gt;

&lt;p&gt;To compare the visual appearance of two PDFs, use:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;comparepdf &lt;span class=&quot;nt&quot;&gt;-ca&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2 whatever.pdf wherever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this case the output either shows “No differences detected” or “Files look different”.&lt;/p&gt;

&lt;h2 id=&quot;repair-a-corrupted-pdf&quot;&gt;Repair a corrupted PDF&lt;/h2&gt;

&lt;p&gt;Sometimes it is possible to recover the contents of corrupted or otherwise damaged PDF documents. &lt;a href=&quot;https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file&quot;&gt;This thread on Super User&lt;/a&gt; mentions two useful options.&lt;/p&gt;

&lt;h3 id=&quot;repair-with-ghostscript&quot;&gt;Repair with Ghostscript&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;gs &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; whatever_repaired.pdf &lt;span class=&quot;nt&quot;&gt;-sDEVICE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;pdfwrite &lt;span class=&quot;nt&quot;&gt;-dPDFSETTINGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/prepress whatever_corrupted.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;repair-with-pdftocairo&quot;&gt;Repair with pdftocairo&lt;/h3&gt;

&lt;p&gt;A second option mentioned in the Super User thread is &lt;em&gt;pdftocairo&lt;/em&gt;, which is part of Xpdf and Poppler:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdftocairo &lt;span class=&quot;nt&quot;&gt;-pdf&lt;/span&gt; whatever_corrupted.pdf whatever_repaired.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It’s worth adding here that the success of any repair action largely depends on the nature and extent of the damage/corruption, so your mileage may very. Always make sure to carefully check the result, and keep a copy of the original file.&lt;/p&gt;

&lt;h3 id=&quot;repair-with-pdftk&quot;&gt;Repair with PDFtk&lt;/h3&gt;

&lt;p&gt;Finally, &lt;em&gt;pdftk&lt;/em&gt; can, &lt;a href=&quot;https://www.pdflabs.com/docs/pdftk-cli-examples/&quot;&gt;according to its documentation&lt;/a&gt;, “repair a PDF’s corrupted XREF table and stream lengths, if possible”. This uses the following command line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdftk whatever_corrupted.pdf output whatever_repaired.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;reduce-size-of-pdf-with-hi-res-images-with-ghostscript&quot;&gt;Reduce size of PDF with hi-res images with Ghostscript&lt;/h2&gt;

&lt;p&gt;The following Ghostscript command (source &lt;a href=&quot;https://askubuntu.com/questions/113544/how-can-i-reduce-the-file-size-of-a-scanned-pdf-file/256449#256449&quot;&gt;here&lt;/a&gt; can be useful to reduce the size of a large PDF with high-resolution graphics (note that this will result in quality loss):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;gs &lt;span class=&quot;nt&quot;&gt;-sDEVICE&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;pdfwrite &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;-dCompatibilityLevel&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1.4 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;-dPDFSETTINGS&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/ebook &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;-dNOPAUSE&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-dQUIET&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-dBATCH&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;-sOutputFile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;whatever_small.pdf whatever_large.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;reduce-size-of-pdf-with-hi-res-images-with-imagemagick&quot;&gt;Reduce size of PDF with hi-res images with ImageMagick&lt;/h2&gt;

&lt;p&gt;As an alternative to the above Ghostscript command (which achieves a size reduction mainly by downsampling the images in the PDF to as lower resolution), you can also use &lt;a href=&quot;https://imagemagick.org/&quot;&gt;ImageMagick&lt;/a&gt;’s &lt;a href=&quot;https://imagemagick.org/script/convert.php&quot;&gt;&lt;em&gt;convert&lt;/em&gt; tool&lt;/a&gt;. This allows you to reduce the file size by changing any combination of resolution (&lt;a href=&quot;https://imagemagick.org/script/command-line-options.php#density&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-density&lt;/code&gt;&lt;/a&gt; option), compression type (&lt;a href=&quot;https://imagemagick.org/script/command-line-options.php#compress&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-compress&lt;/code&gt;&lt;/a&gt; option) and compression quality (&lt;a href=&quot;https://imagemagick.org/script/command-line-options.php#quality&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-quality&lt;/code&gt;&lt;/a&gt; option).&lt;/p&gt;

&lt;p&gt;For example, the command below  (source &lt;a href=&quot;https://askubuntu.com/questions/113544/how-can-i-reduce-the-file-size-of-a-scanned-pdf-file/469255#469255&quot;&gt;here&lt;/a&gt;) reduces the size of a source PDF by re-encoding all images as JPEGs with 70% quality at 300 ppi resolution:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;convert &lt;span class=&quot;nt&quot;&gt;-density&lt;/span&gt; 300 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;-compress&lt;/span&gt; jpeg &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;-quality&lt;/span&gt; 70 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
         whatever_large.pdf whatever_small.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-density&lt;/code&gt; value is omitted, &lt;em&gt;convert&lt;/em&gt; resamples all images to 72 ppi by default. If you don’t want that, make sure to set the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-density&lt;/code&gt; value to the resolution of your source PDF (see the section “List embedded image information with pdfimages” on how to do that).&lt;/p&gt;

&lt;p&gt;Even though ImageMagick’s &lt;em&gt;convert&lt;/em&gt; tool uses Ghostscript under the hood, it doesn’t preserve any text (and probably most other features) of the source PDF, so only use this if you’re only interested in the image data!&lt;/p&gt;

&lt;h2 id=&quot;inspect-low-level-pdf-structure&quot;&gt;Inspect low-level PDF structure&lt;/h2&gt;

&lt;p&gt;The following tools are useful for inspecting and browsing the internal (low-level object) structure of PDF files.&lt;/p&gt;

&lt;h3 id=&quot;inspect-with-pdfbox-pdfdebugger&quot;&gt;Inspect with PDFBox PDFDebugger&lt;/h3&gt;

&lt;p&gt;PDFBox includes a “PDF Debugger”, which you can start with the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/pdfbox/pdfbox-app-2.0.24.jar PDFDebugger whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Subsequently a GUI window pops up that allows you to browse the PDF’s internal objects:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/09/pdf-debugger.png&quot; alt=&quot;PDF Debugger screenshot&quot; /&gt;
  &lt;figcaption&gt;Screenshot of PDFBOX PDFDebugger.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;inspect-with-itext-rups&quot;&gt;Inspect with iText RUPS&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/itext/i7j-rups&quot;&gt;itext RUPS&lt;/a&gt; viewer provides similar functionality to PDF Debugger. You can download a self-contained runnable JAR &lt;a href=&quot;https://github.com/itext/i7j-rups/releases/latest&quot;&gt;here&lt;/a&gt; (select the “only-jars” ZIP file). Run it using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; ~/itext-rups/itext-rups-7.1.16.jar
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then open a PDF from the GUI, and browse your way through its internal structure:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/09/itext-rups.png&quot; alt=&quot;iText RUPS screenshot&quot; /&gt;
  &lt;figcaption&gt;Screenshot of iText RUPS.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;view-search-and-extract-pdf-objects-with-mutool-show&quot;&gt;View, search and extract PDF objects with mutool show&lt;/h2&gt;

&lt;p&gt;Mutool’s &lt;em&gt;show&lt;/em&gt; command allows you to print user-defined low-level PDF objects to stdout. A couple of things you can do with this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Print the document trailer:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mutool show whatever.pdf trailer
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Result:&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;trailer
&amp;lt;&amp;lt;
/DecodeParms &amp;lt;&amp;lt;
  /Columns 3
  /Predictor 12
&amp;gt;&amp;gt;
/Filter /FlateDecode
/ID [ &amp;lt;500AB94E8F45C149808B2EEE98528B78&amp;gt; &amp;lt;431017E495216040A953126BB73D0CD4&amp;gt; ]
/Index [ 11 10 ]
/Info 10 0 R
/Length 47
/Prev 24426
/Root 12 0 R
/Size 21
/Type /XRef
/W [ 1 2 0 ]
&amp;gt;&amp;gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Print the cross-reference table:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mutool show whatever.pdf xref
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Result:&lt;/p&gt;
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;xref
0 21
00000: 0000000000 00000 f 
00001: 0000019994 00000 n 
00002: 0000020399 00000 n 
00003: 0000020534 00000 n 
::
etc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Print an indirect object by its number:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mutool show whatever.pdf 12
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Result:&lt;/p&gt;

    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;12 0 obj
&amp;lt;&amp;lt;
/Metadata 4 0 R
/Pages 9 0 R
/Type /Catalog
&amp;gt;&amp;gt;
endobj
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Extract only stream contents as raw binary data and write to a new file:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mutool show &lt;span class=&quot;nt&quot;&gt;-be&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; whatever.dat whatever.pdf 151
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;This command is particularly useful for extracting the raw data from a stream object (e.g. an image or multimedia file).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More advanced queries are possible as well. For example, the &lt;a href=&quot;https://mupdf.com/docs/manual-mutool-show.html&quot;&gt;mutool manual&lt;/a&gt; gives the following example, which shows all JPEG compressed stream objects in a file:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mutool show whatever.pdf &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; | &lt;span class=&quot;nb&quot;&gt;grep&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;/Filter/DCTDecode&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;1 0 obj &amp;lt;&amp;lt;/BitsPerComponent 8/ColorSpace/DeviceRGB/Filter/DCTDecode/Height 516/Length 76403/Subtype/Image/Type/XObject/Width 1226&amp;gt;&amp;gt; stream
18 0 obj &amp;lt;&amp;lt;/BitsPerComponent 8/ColorSpace/DeviceRGB/Filter/DCTDecode/Height 676/Length 149186/Subtype/Image/Type/XObject/Width 1014&amp;gt;&amp;gt; stream
19 0 obj &amp;lt;&amp;lt;/BitsPerComponent 8/ColorSpace/DeviceRGB/Filter/DCTDecode/Height 676/Length 142232/Subtype/Image/Type/XObject/Width 1014&amp;gt;&amp;gt; stream
24 0 obj &amp;lt;&amp;lt;/BitsPerComponent 8/ColorSpace/DeviceRGB/Filter/DCTDecode/Height 676/Length 192073/Subtype/Image/Type/XObject/Width 1014&amp;gt;&amp;gt; stream
25 0 obj &amp;lt;&amp;lt;/BitsPerComponent 8/ColorSpace/DeviceRGB/Filter/DCTDecode/Height 676/Length 141081/Subtype/Image/Type/XObject/Width 1014&amp;gt;&amp;gt; stream
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;incremental-updates-and-document-versions&quot;&gt;Incremental updates and document versions&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/enferex/pdfresurrect&quot;&gt;&lt;em&gt;pdfresurrect&lt;/em&gt;&lt;/a&gt; tool&lt;sup id=&quot;fnref:10&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;10&lt;/a&gt;&lt;/sup&gt; allows you to inspect a PDF for incremental updates, and restore previous versions of the document.&lt;/p&gt;

&lt;h3 id=&quot;show-the-number-of-incremental-updatesversions&quot;&gt;Show the number of incremental updates/versions&lt;/h3&gt;

&lt;p&gt;To show the number of incremental updates, use:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfresurrect &lt;span class=&quot;nt&quot;&gt;-q&lt;/span&gt; whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;whatever.pdf: 8
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, in this case the PDF contains 8 versions, which means that after its initial creation it was updated 7 times.&lt;/p&gt;

&lt;h3 id=&quot;restore-all-versions&quot;&gt;Restore all versions&lt;/h3&gt;

&lt;p&gt;To restore all versions of a PDF, use:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pdfresurrect &lt;span class=&quot;nt&quot;&gt;-w&lt;/span&gt; whatever.pdf
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This creates a directory “whatever-versions” with all versions of the file, as well as a text file with a summary of the changes between the versions:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;./whatever-versions/
├── whatever-version-1.pdf
├── whatever-version-2.pdf
├── whatever-version-3.pdf
├── whatever-version-4.pdf
├── whatever-version-5.pdf
├── whatever-version-6.pdf
├── whatever-version-7.pdf
├── whatever-version-8.pdf
└── whatever-versions.summary
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;remove-information-from-previous-versions&quot;&gt;Remove information from previous versions&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;Pdfresurrect&lt;/em&gt; also allows you to remove (“scrub”) the information from previous versions altogether, using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-s&lt;/code&gt; switch. However, the &lt;a href=&quot;https://github.com/enferex/pdfresurrect&quot;&gt;documentation&lt;/a&gt; warns that this functionality is experimental, and that it “should not be trusted for any serious security uses”. It also mentions that “currently this feature will likely not render a working pdf”. I did a quick test on a PDF with 8 document versions. For the resulting “scrubbed” PDF, running &lt;em&gt;pdfresurrect&lt;/em&gt; with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-q&lt;/code&gt; switch still resulted in 8 reported versions, and running it with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-w&lt;/code&gt; switch resulted in 8 restored versions (which oddly were not identical to the versions retored from the original file). In its current state I wouldn’t recommend using this feature.&lt;/p&gt;

&lt;h2 id=&quot;final-remarks&quot;&gt;Final remarks&lt;/h2&gt;

&lt;p&gt;I intend to make this post a “living” document, and will add more PDF “recipes” over time. Feel free to leave a comment in case you spot any errors or omissions!&lt;/p&gt;

&lt;h2 id=&quot;update-on-hacker-news-topic&quot;&gt;Update on Hacker News topic&lt;/h2&gt;

&lt;p&gt;Someone created a &lt;a href=&quot;https://news.ycombinator.com/item?id=33145498&quot;&gt;Hacker News topic on this post&lt;/a&gt;. The comments mention some additional tool suggestions that look useful. I might add some of these to a future revision.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://doi.org/10.46430/phen0088&quot;&gt;Moritz Mähr, “Working with batches of PDF files”, The Programming Historian 9 (2020)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://coptr.digipres.org/index.php/PDF&quot;&gt;PDF tools in Community Owned Digital Preservation Tool Registry (COPTR)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression&quot;&gt;Policy-based assessment with VeraPDF - a first impression&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://filingdb.com/b/pdf-text-extraction&quot;&gt;What’s so hard about PDF text extraction? ​&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://irsg.bcs.org/informer/wp-content/uploads/OverviewOfTextExtractionFromPDFs.pdf&quot;&gt;Tim Allison, “Brief Overview of the Portable Document Format (PDF) and Some Challenges for Text Extraction”&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://openpreservation.org/blogs/pdf-validation-with-exiftool-quick-and-not-so-dirty/&quot;&gt;Yvonne Tunnat, “PDF Validation with ExifTool – quick and not so dirty”&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://openpreservation.org/blogs/trouble-shooting-pdf-validation-errors-a-case-of-pdf-hul-38/&quot;&gt;Micky Lindlar, “Trouble-shooting PDF validation errors – a case of PDF-HUL-38”&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://news.ycombinator.com/item?id=33145498&quot;&gt;Hacker News topic on this post&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;7 September 2021: added sections on metadata extraction and Tika batch processing, following suggestions by Tim Allison.&lt;/li&gt;
  &lt;li&gt;8 September 2021: added section on inspecting low-level PDF structure with iText RUPS, as suggested by Mark Stephens; added sections on PDFtk as suggested by Tyler Thorsted; corrected errors in &lt;em&gt;pdftocairo&lt;/em&gt; and &lt;em&gt;gs&lt;/em&gt; examples.&lt;/li&gt;
  &lt;li&gt;9 September 2021: added section on image to PDF conversion.&lt;/li&gt;
  &lt;li&gt;27 January 2022: added reference to Tim Allison’s article on PDF text extraction.&lt;/li&gt;
  &lt;li&gt;7 February 2022: added sections on Exiftool, and added reference to Yvonne Tunnat’s blog post on PDF validation with ExifTool.&lt;/li&gt;
  &lt;li&gt;10 October 2022: added update on and link to Hacker News topic on this post.&lt;/li&gt;
  &lt;li&gt;28 November 2022: added reference to Micky Lindlar’s blog post on trouble-shooting PDF validation errors.&lt;/li&gt;
  &lt;li&gt;16 February 2023: added section on reducing PDF file size with ImageMagick’s &lt;em&gt;convert&lt;/em&gt; tool.&lt;/li&gt;
  &lt;li&gt;26 September 2024: corrected mutool stream content extraction example.&lt;/li&gt;
  &lt;li&gt;18 December 2024: added section on Arlington PDF Model Checker.&lt;/li&gt;
  &lt;li&gt;25 January 2025: added section on incremental updates and document versions.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;The Debian package of the “original” PDFtk software was &lt;a href=&quot;https://www.joho.se/2020/10/01/pdftk-and-php-pdftk-on-ubuntu-18-04-without-using-snap/&quot;&gt;removed from the Ubuntu repositories&lt;/a&gt; around 2018 due to “dependency issues”. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;See the “Limitations” section in &lt;a href=&quot;https://github.com/pdf-association/arlington-pdf-model&quot;&gt;the Arlington Model readme&lt;/a&gt;. Personally, I’d be really excited to see some future software tool that combines the Arlington Model logic with additional checks for aspects that are not covered by it, such as file structure and cross-reference tables. This could really be the ultimate PDF validator, that would make several other tools in this section obsolete. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;Because of of this near-complete coverage of PDF objects, it’s also likely to report more problems for any given file than other tools. Since PDF readers are generally quite forgiving of common deviations of the specifification, many of these problems won’t affect rendering. There’s not much to do about this, as a validator by defininition needs to be strict. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Command line: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;pdfinfo whatever.pdf&lt;/code&gt; &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;In this example output is redirected to a file; this is generally a good idea because of the amount of XML output generated by VeraPDF. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--off&lt;/code&gt; switch disables PDF/A validation. Output is redirected to a file (recommended because, depending on the configuration used, VeraPDF can generate a &lt;em&gt;lot&lt;/em&gt; of output). &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;This is because a new Java VM is started for each processed PDF, which will result in poor performance. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;Of course this also works for metadata extraction, and both text and metadata extraction can be combined in one single command. As an example, the following command will extract both text and metadata, including any embedded documents: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java -jar ~/tika/tika-app-2.1.0.jar -J --text -i ./myPDFs/ -o ./tika-out/&lt;/code&gt; &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;On Debian-based systems you can install it using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sudo apt install comparepdf&lt;/code&gt;. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot;&gt;
      &lt;p&gt;On Debian-based systems you can install it using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sudo apt install pdfresurrect&lt;/code&gt;. &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2021/09/06/pdf-processing-and-analysis-with-open-source-tools</link>
                <guid>https://bitsgalore.org/2021/09/06/pdf-processing-and-analysis-with-open-source-tools</guid>
                <pubDate>2021-09-06T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Towards a preservation workflow for mobile apps</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/stewardess-phone.jpg&quot; alt=&quot;Satellite image of Wadden Sea&quot; /&gt;
  &lt;figcaption&gt;Production photo from &quot;2001: A Space Odyssey&quot;. ©Stanley Kubrick Archives/TASCHEN.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;My &lt;a href=&quot;/2021/02/09/four-android-emulators-two-apps&quot;&gt;previous post&lt;/a&gt; addressed the emulation of mobile Android apps. In this follow-up, I’ll explore some other aspects of mobile app preservation, with a focus on acquisition and ingest processes. The &lt;a href=&quot;https://zenodo.org/record/3460450&quot;&gt;2019 iPres paper on the Acquisition and Preservation of Mobile eBook Apps&lt;/a&gt; by Maureen Pennock, Peter May and Michael Day again was the departure point. In its concluding section, they recommend:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In terms of target formats for acquisition, we reach the undeniable conclusion that acquisition of the app in its packaged form (either an IPA file or an APK file) is optimal for ensuring organisations at least acquire a complete published object for preservation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[T]his form should at least also include sufficient metadata about inherent technical dependencies to understand what is needed to meet them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In practical terms, this means that the workflows that are used for acquisition and (pre-)ingest must include components that are able to deal with the following aspects:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Acquisition of the app packages (either by direct deposit from the publisher, or using the app store).&lt;/li&gt;
  &lt;li&gt;Identification of the package format (APK for Android, IPA for iOS).&lt;/li&gt;
  &lt;li&gt;Identification of metadata about the app’s technical dependencies.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The main objective of this post is to get an idea of what would be needed to implement these components. Is it possible to do all of this with existing tools? If not so, what are the gaps? The underlying assumption here is an emulation-based preservation strategy&lt;sup id=&quot;fnref:14&quot;&gt;&lt;a href=&quot;#fn:14&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;outline-of-this-post&quot;&gt;Outline of this post&lt;/h2&gt;

&lt;p&gt;As for the acquisition component, Pennock, May and Day recommend direct publisher deposit, as this may avoid some potential problems related to digital rights management and dependencies on content that is hosted remotely. Since we’ve only started exploring mobile app preservation at this stage, it’s too early to make any assumptions about what acquisition route would work best in our case. Because of this, I started out by investigating to what extent it is possible to download APK and iOS packages from the Google Play Store and the Apple App Store, respectively.&lt;/p&gt;

&lt;p&gt;I then tried to do automatic format identification on some sample APK and IPA packages, using recent versions of &lt;a href=&quot;https://www.itforarchivists.com/siegfried/&quot;&gt;Siegfried&lt;/a&gt;, &lt;a href=&quot;http://darwinsys.com/file/&quot;&gt;Unix File&lt;/a&gt; and &lt;a href=&quot;http://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Next, I looked at how the APK and IPA formats store metadata about an app’s technical dependencies, and how this information can be extracted.&lt;/p&gt;

&lt;p&gt;The following sections discuss these components for both the Android and iOS platforms. For convenience I included a summary of the main findings at the end of this post, followed by some observations on the value of additional documentation such as video recordings that show mobile apps in action.&lt;/p&gt;

&lt;h2 id=&quot;downloading-android-packages&quot;&gt;Downloading Android packages&lt;/h2&gt;

&lt;p&gt;Android apps are distributed as &lt;a href=&quot;https://en.wikipedia.org/wiki/Android_application_package&quot;&gt;Android Package (APK)&lt;/a&gt; installer files through the &lt;a href=&quot;https://play.google.com/store/apps?hl=en&quot;&gt;Google Play Store&lt;/a&gt;. However, the Play Store doesn’t allow you to download APK files on a non-Android device, which is a problem within a preservation workflow. Various third-party websites exist that offer the possibility to download APK installers, but it is often difficult to establish their trustworthiness. Some of these sites also re-package the original app data, which introduces various concerns related to security and authenticity. Because of this, I would strongly advise against using any of these services within a preservation workflow.&lt;/p&gt;

&lt;h3 id=&quot;virtual-machine-method&quot;&gt;Virtual machine method&lt;/h3&gt;

&lt;p&gt;A better (but somewhat cumbersome) solution would be to set up a virtual machine or emulator with Android&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, and use that to install the app directly from the Play store. Once installed, it is possible to transfer the APK installer file from the virtual machine to the host machine using the &lt;a href=&quot;https://developer.android.com/studio/command-line/adb&quot;&gt;Android Debug Bridge (adb) tool&lt;/a&gt;. This involves the following steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Get a list of all installed packages on the virtual machine (redirecting output to a text file):
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb shell pm list packages &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; packages.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Look up the package id of the installed app in this file. Taking the &lt;a href=&quot;https://play.google.com/store/apps/details?id=com.Triplee.TripleeSocial&quot;&gt;ARize app&lt;/a&gt; from my previous post as an example, the identifier is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;com.Triplee.TripleeSocial&lt;/code&gt;. We can then use the following command to find the full file path of the package on the VM:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb shell pm path com.Triplee.TripleeSocial
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;Result:&lt;/p&gt;
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;package:/data/app/com.Triplee.TripleeSocial-r8iVFUp1MOSAc6LmHA1MDQ==/base.apk
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Use the above file path to download the package to the host machine:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb pull /data/app/com.Triplee.TripleeSocial-r8iVFUp1MOSAc6LmHA1MDQ&lt;span class=&quot;o&quot;&gt;==&lt;/span&gt;/base.apk
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this case, this results in a file “base.apk”.&lt;/p&gt;

&lt;h3 id=&quot;gplaycli-method&quot;&gt;Gplaycli method&lt;/h3&gt;

&lt;p&gt;As the above method is a bit clumsy, I started looking for tools that allow downloading packages from the Play Store directly. Several such open-source tools exist, but many of these are abondoned projects that no longer work. After trying out a few of them, I utimately had success with &lt;a href=&quot;https://github.com/matlink/gplaycli&quot;&gt;gplaycli&lt;/a&gt;, which is “a command line tool to search, install, update Android applications from the Google Play Store”. It allows you to download an APK, using the App ID as an identifier. Taking the ARize app as an example again, we can download the APK with the following command&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;gplaycli &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; com.Triplee.TripleeSocial
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in a file “com.Triplee.TripleeSocial.apk”. I verified the file by doing a bitwise comparison against the APK obtained from the “virtual machine method” described in the previous section&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. This confirmed both files were identical. It’s worth mentioning that gplaycli is also &lt;a href=&quot;https://github.com/matlink/gplaycli/issues/8&quot;&gt;reported to work for downloading paid apps&lt;/a&gt; (provided the proper login credentials are used), but I haven’t tested this.&lt;/p&gt;

&lt;h2 id=&quot;android-package-identification&quot;&gt;Android package identification&lt;/h2&gt;

&lt;p&gt;As most archival ingest workflows include a format identification component, I tried to identify the ARize and Immer Android packages from my previous post with 3 widely used format identification tools. The table below shows the results:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Tool&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Version&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;ID&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://www.itforarchivists.com/siegfried/&quot;&gt;Siegfried&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.9.1; DROID Signature File V97&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;x-fmt/412 (Java Archive Format)&lt;br /&gt;x-fmt/263 (ZIP Format)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://darwinsys.com/file/&quot;&gt;Unix File&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.32&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;application/zip&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.23&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;application/vnd.android.package-archive&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Apache Tika was the only tool that identified both files as Android packages. Siegfried (which uses the &lt;a href=&quot;https://www.nationalarchives.gov.uk/PRONOM/Default.aspx&quot;&gt;PRONOM&lt;/a&gt; format signatures) identified one file as a regular ZIP file, and the other one as a &lt;a href=&quot;https://en.wikipedia.org/wiki/JAR_(file_format)&quot;&gt;Java Archive&lt;/a&gt;. Since the Android package format is based on the Java Archive format (which is in turn a subset of the ZIP format) this result is not necessarily wrong, but it lacks specificity. At the time of writing, the &lt;a href=&quot;https://www.nationalarchives.gov.uk/PRONOM/Default.aspx&quot;&gt;PRONOM&lt;/a&gt; technical registry does not have an entry for the Android package format&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, so this result is not surprising. A look at &lt;a href=&quot;https://github.com/apache/tika/blob/618345263ee41108e1a225dbcdbb8db16b2aae28/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L316&quot;&gt;Tika’s Mimetype definition file&lt;/a&gt; reveals that Tika only uses the file extension to differentiate between the Android packages and Java archives:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;mime-type&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/vnd.android.package-archive&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;sub-class-of&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/java-archive&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;glob&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;pattern=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;*.apk&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/mime-type&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;android-package-metadata&quot;&gt;Android package metadata&lt;/h2&gt;

&lt;p&gt;To ensure long-term access, it is vital that an archived app installer is accompanied by &lt;a href=&quot;https://en.wikipedia.org/wiki/Preservation_metadata&quot;&gt;preservation metadata&lt;/a&gt; about the technical environment that is needed to render it. At the very minimum this would include details about the required Android version(s), hardware features, and shared software libraries. This information (and much more) is stored in an Android Package’s &lt;a href=&quot;https://developer.android.com/guide/topics/manifest/manifest-intro&quot;&gt;App Manifest&lt;/a&gt;. The App Manifest is stored in a &lt;a href=&quot;https://en.wikipedia.org/wiki/Binary_XML&quot;&gt;binary XML&lt;/a&gt; format for which &lt;a href=&quot;https://reverseengineering.stackexchange.com/questions/21806/where-is-android-binary-xml-format-documented&quot;&gt;no publicly available documentation exists&lt;/a&gt;. This makes reading it somewhat challenging, although &lt;a href=&quot;https://stackoverflow.com/q/4191762/1209004&quot;&gt;various software solutions for decoding the App Manifest exist&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;extraction-of-app-manifest&quot;&gt;Extraction of App Manifest&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://developer.android.com/studio/command-line/apkanalyzer.html&quot;&gt;Apkanalyzer&lt;/a&gt;, which is part of &lt;a href=&quot;https://developer.android.com/studio/&quot;&gt;Android Studio&lt;/a&gt;, is the tool that is officially supported by Google. However, running apkanalyzer only resulted in a sequence of Java exceptions for me&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;. Besides, it’s not entirely clear if the &lt;a href=&quot;https://developer.android.com/studio/terms.html&quot;&gt;terms and conditions&lt;/a&gt; of Android Studio permit is use in an archival workflow&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;. I eventually found &lt;a href=&quot;https://github.com/androguard/androguard&quot;&gt;Androguard&lt;/a&gt;, which is a Python-based tool that is primarily aimed at reverse-engineering Android apps. Using the command below, it will extract and decode an APK’s app manifest, resulting in a human-readable XML file:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;androguard axml com.Triplee.TripleeSocial.apk &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; arize-android.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;interesting-app-manifest-elements&quot;&gt;Interesting App Manifest elements&lt;/h2&gt;

&lt;p&gt;The decoded app manifest from the above example can be found in full &lt;a href=&quot;https://github.com/KBNLresearch/mobile-apps/blob/main/sample-files/arize-androidManifest.xml&quot;&gt;here&lt;/a&gt;. A detailed discussion of the app manifest is beyond the scope of this post, but it’s worth highlighting a few elements that are particularly interesting:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;a href=&quot;https://developer.android.com/guide/topics/manifest/uses-sdk-element&quot;&gt;uses-sdk&lt;/a&gt; element contains information about the app’s compatibility with one or more Android versions:
    &lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;uses-sdk&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;android:minSdkVersion=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;24&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;android:targetSdkVersion=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;29&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;Here, the (confusingly named) &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;minSdkVersion&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;targetSdkVersion&lt;/code&gt; attributes define the minimum and target API levels of the app, respectively. In &lt;a href=&quot;https://developer.android.com/guide/topics/manifest/uses-sdk-element#ApiLevels&quot;&gt;the table here&lt;/a&gt; we see that API level 24 (the minimum level) corresponds to Android 7.0, and level 29 (the target level) to Android 10.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://developer.android.com/guide/topics/manifest/uses-feature-element&quot;&gt;uses-feature&lt;/a&gt; element is used to declare hardware or software features that are used by the app:
    &lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;uses-feature&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;android:name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;android.hardware.camera&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;android:required=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;true&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
    &lt;p&gt;In the above example, it informs us that the app needs a camera.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://developer.android.com/guide/topics/manifest/uses-library-element&quot;&gt;uses-library&lt;/a&gt; element tells us about any shared libraries that the app depends on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The above information largely defines the (emulated) technical environment that is required to run the app. Even though I’ve only skimmed the surface of the App Manifest here, its importance as a source for deriving technical and preservation metadata about an Android app should be clear.&lt;/p&gt;

&lt;h2 id=&quot;downloading-ios-packages&quot;&gt;Downloading iOS packages&lt;/h2&gt;

&lt;p&gt;Apple iOS apps are distributed through the &lt;a href=&quot;https://www.apple.com/app-store/&quot;&gt;Apple App Store&lt;/a&gt;. Installer packages are published in the &lt;a href=&quot;https://en.wikipedia.org/wiki/.ipa&quot;&gt;iOS App Store Package (IPA)&lt;/a&gt; format. A good overview of the format can be found &lt;a href=&quot;https://web.archive.org/web/20200714200020/https://blog.razb.me/pulling-apart-an-ios-app/&quot;&gt;here&lt;/a&gt;. Like Google’s Play Store, Apple doesn’t allow you to download the packages on anything but an Apple device. Unlike the Android situation, there don’t appear to be any tools that are able to get around this limitation, and this seriously limits the possibilities to incorporate downloading iOS packages as part of a preservation workflow. It might be possible to work around these limitations to some degree by installing the app on either a physical or virtual&lt;sup id=&quot;fnref:10&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;9&lt;/a&gt;&lt;/sup&gt; iOS device, and then transfer the app to another machine. The open-source &lt;a href=&quot;https://libimobiledevice.org/&quot;&gt;libimobiledevice&lt;/a&gt; library appears to be capable of file transfers between iOS and other platforms. However, according to various online sources iOS doesn’t actually keep the original IPA files after installation&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;. I’m unable to confirm this, as I don’t currently have an iOS device available for further testing.&lt;/p&gt;

&lt;h2 id=&quot;ios-package-identification&quot;&gt;iOS package identification&lt;/h2&gt;

&lt;p&gt;As I was unable to obtain any IPA installers from the Apple App Store, I downloaded some random IPA files from an &lt;a href=&quot;https://en.wikipedia.org/wiki/IOS_jailbreaking&quot;&gt;iOS jailbreaking&lt;/a&gt; website&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;, and ran them through Siegfried, Unix File and Apache Tika. The table below shows the results:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Tool&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Version&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;ID&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://www.itforarchivists.com/siegfried/&quot;&gt;Siegfried&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.9.1; DROID Signature File V97&lt;sup id=&quot;fnref:7:1&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;x-fmt/263 (ZIP Format)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://darwinsys.com/file/&quot;&gt;Unix File&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.32&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;application/zip&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.23&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;application/x-itunes-ipa&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;These results are similar to the situation for Android packages. Only Apache Tika was able to identify these files as IPA packages. Both Siegfried and File could only detect the container format. An inspection of PRONOM confirmed that it doesn’t include the IPA format yet. Also, Tika’s specific result for this format is again &lt;a href=&quot;https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L3819&quot;&gt;only based on a file extension pattern&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;mime-type&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/x-itunes-ipa&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;sub-class-of&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/zip&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;_comment&amp;gt;&lt;/span&gt;Apple iOS IPA AppStore file&lt;span class=&quot;nt&quot;&gt;&amp;lt;/_comment&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;glob&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;pattern=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;*.ipa&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/mime-type&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;ios-package-metadata&quot;&gt;iOS package metadata&lt;/h2&gt;

&lt;p&gt;For iOS apps, the &lt;a href=&quot;https://developer.apple.com/documentation/bundleresources/information_property_list&quot;&gt;information property list file&lt;/a&gt; (Info.plist) in the root of the bundle directory&lt;sup id=&quot;fnref:11&quot;&gt;&lt;a href=&quot;#fn:11&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;12&lt;/a&gt;&lt;/sup&gt; contains various metadata, including information about the required technical environment. Confusingly, &lt;a href=&quot;https://en.wikipedia.org/wiki/Property_list&quot;&gt;Apple property lists&lt;/a&gt; can be implemented in both XML and binary formats. For both test files I analyzed, the format was XML, but I’m not entirely sure if this is always the case. The Apple developer’s site provides &lt;a href=&quot;https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/PropertyLists/UnderstandXMLPlist/UnderstandXMLPlist.html#//apple_ref/doc/uid/10000048i-CH6-SW1&quot;&gt;a brief explanation of the format&lt;/a&gt;, and I’ve uploaded &lt;a href=&quot;https://github.com/KBNLresearch/mobile-apps/blob/main/sample-files/Info.plist&quot;&gt;an example file here&lt;/a&gt;. Even though the format only defines simple key-value pairs, Apple’s implementation is unusual to say the least. Instead of using the XML hierarchy, values are defined by their position relative to the “key” elements. The following fragment illustrates this:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;key&amp;gt;&lt;/span&gt;MinimumOSVersion&lt;span class=&quot;nt&quot;&gt;&amp;lt;/key&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;string&amp;gt;&lt;/span&gt;7.0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/string&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;key&amp;gt;&lt;/span&gt;UIDeviceFamily&lt;span class=&quot;nt&quot;&gt;&amp;lt;/key&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;array&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;integer&amp;gt;&lt;/span&gt;1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/integer&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;integer&amp;gt;&lt;/span&gt;2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/integer&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/array&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In this example, the value of &lt;em&gt;MinimumOSVersion&lt;/em&gt; is defined by the &lt;em&gt;string&lt;/em&gt; element that directly follows the &lt;em&gt;key&lt;/em&gt; element; likewise, the value of &lt;em&gt;UIDeviceFamily&lt;/em&gt; is defined by the &lt;em&gt;array&lt;/em&gt; element. This unusual layout means that simple parsing of these files with an XML library is not enough to interpret them in a meaningful way!&lt;/p&gt;

&lt;h2 id=&quot;extraction-of-information-property-list&quot;&gt;Extraction of information property list&lt;/h2&gt;

&lt;p&gt;I was unable to find any tools that directly extract and process the information property list (similar to what &lt;a href=&quot;https://github.com/androguard/androguard&quot;&gt;Androguard&lt;/a&gt; does for Android packages)&lt;sup id=&quot;fnref:12&quot;&gt;&lt;a href=&quot;#fn:12&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;13&lt;/a&gt;&lt;/sup&gt;. However, Python has a built-in &lt;a href=&quot;https://docs.python.org/3/library/plistlib.html&quot;&gt;plistlib&lt;/a&gt; module that is able to read and write property lists in both binary and XML format. I did some tests with it, and at first sight it appears to work well: reading the XML property lists for each of my test apps resulted in a Python dictionary that accurately represented the key-value pairs&lt;sup id=&quot;fnref:13&quot;&gt;&lt;a href=&quot;#fn:13&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;. Using this module, it would be fairly straightforward to write a tool that extracts the property list items directly out of an IPA file, and transform them into a more manageable format.&lt;/p&gt;

&lt;h2 id=&quot;interesting-property-list-elements&quot;&gt;Interesting property list elements&lt;/h2&gt;

&lt;p&gt;As with the Android App Manifest before, I won’t go into a detailed discussion of all the items inside the information property list, but the following ones caught my attention:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;a href=&quot;https://developer.apple.com/library/archive/documentation/DeviceInformation/Reference/iOSDeviceCompatibility/DeviceCompatibilityMatrix/DeviceCompatibilityMatrix.html&quot;&gt;UIRequiredDeviceCapabilities&lt;/a&gt; key declares “the hardware or specific capabilities” that an app needs in order to run. This &lt;a href=&quot;https://developer.apple.com/library/archive/documentation/General/Reference/InfoPlistKeyReference/Articles/iPhoneOSKeys.html#//apple_ref/doc/uid/TP40009252-SW3&quot;&gt;includes&lt;/a&gt;, among other things, access to networking features, a camera or a microphone.&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://developer.apple.com/documentation/bundleresources/information_property_list/minimumosversion&quot;&gt;MinimumOSVersion&lt;/a&gt; key defines the minimum operating system version required for the app to run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both are directly relevant for emulation purposes.&lt;/p&gt;

&lt;h2 id=&quot;summary-of-test-results&quot;&gt;Summary of test results&lt;/h2&gt;

&lt;p&gt;As this is quite a lengthy post, here’s a brief summary of the main results of the above tests:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Downloading Android APK packages from the Google Play store is possible, but it does require unofficial third-party tools like &lt;a href=&quot;https://github.com/matlink/gplaycli&quot;&gt;gplaycli&lt;/a&gt;. However, such tools may stop functioning if Google applies changes to its Play Store API. This has already happened previously, leading to various unmaintained tools that no longer work.&lt;/li&gt;
  &lt;li&gt;Access to the Apple App Store appears to be completely restricted from non-Apple devices. Since installing an app on iOS reportedly gets rid of the IPA container, workarounds that use a native iOS device as an intermediate medium most likely won’t be usable for preservation workflows.&lt;/li&gt;
  &lt;li&gt;Out of the three file format identification tools tested, only Apache Tika was able to correctly identify both APK and IPA files. However, Tika’s specific results are solely based on file extension patterns. At the time of writing PRONOM doesn’t cover these formats at all, and any tools that use its database (Siegfried, but also DROID and FIDO) only identify them at the higher container levels (ZIP, JAR). This could be easily remedied by developing PRONOM signatures for both formats&lt;sup id=&quot;fnref:16&quot;&gt;&lt;a href=&quot;#fn:16&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;15&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;Both the APK and IPA formats contain package-level metadata about an app’s technical dependencies, such as the minimal OS version and required hardware.&lt;/li&gt;
  &lt;li&gt;For the APK format a software tool that extracts this information is readily available. For the IPA format no such tool exists, but it could be developed with a limited amount of effort.&lt;/li&gt;
  &lt;li&gt;Based on the cursory look presented here, these package-level metadata appear to be adequate for establishing the emulated environment needed to run an app, but they do not expose any dependencies on remotely hosted content&lt;sup id=&quot;fnref:15&quot;&gt;&lt;a href=&quot;#fn:15&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;16&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;value-of-documentation&quot;&gt;Value of documentation&lt;/h2&gt;

&lt;p&gt;Pennock, May and Day propose the use of “alternative solutions such as recording or documentation” in case the end access solution (e.g. the emulator) does not provide a sufficiently ‘authentic’ experience. In addition to this, Trevor Owens makes an argument for storing screenshots and video recordings of software in his book “The theory and craft of digital preservation”&lt;sup id=&quot;fnref:18&quot;&gt;&lt;a href=&quot;#fn:18&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;17&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[M]oving to an approach to virtualize or emulate old systems on new hardware will inevitably be a complex process and having even a reference image of what it looked like at a particular moment in time would likely be valuable as a way to evaluate the extent to which it is being authentically rendered. Here a general principle emerges. Documentation (like the screenshot) can be useful as both a means to preserve significance and also as a means to create reference material to triangulate an object’s or work’s significance.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For both the ARize and Immer apps from my previous blog post, Youtube channels exist with videos that demonstrate how these apps work&lt;sup id=&quot;fnref:19&quot;&gt;&lt;a href=&quot;#fn:19&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;18&lt;/a&gt;&lt;/sup&gt;. I don’t know how common this situation is for other apps, but such videos could serve as a reference for (future) emulation efforts, and it might be worthwhile to collect them as part of the acquisition process, and store them alongside the app packages.&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;This post is only a first (and at this stage incomplete) attempt at piecing together candidate components for acquisition and (pre-)ingest workflows for Android and iOS apps. Please feel free to use the comment section in case you have any corrections or additions.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://zenodo.org/record/3460450&quot;&gt;Considerations on the Acquisition and Preservation of Mobile eBook Apps&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/matlink/gplaycli&quot;&gt;gplaycli&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/androguard/androguard&quot;&gt;Androguard&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://web.archive.org/web/20200714200020/https://blog.razb.me/pulling-apart-an-ios-app/&quot;&gt;Pulling apart an iOS App&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://developer.android.com/guide/topics/manifest/manifest-intro&quot;&gt;Android App Manifest documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://developer.apple.com/library/archive/documentation/General/Reference/InfoPlistKeyReference/Introduction/Introduction.html&quot;&gt;Information Property List Key Reference&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:14&quot;&gt;
      &lt;p&gt;I’m well aware that the possibilities for emulating iOS-based devices are still very limited (for both technical and legal reasons), but that may be the subject of another post. &lt;a href=&quot;#fnref:14&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;See my &lt;a href=&quot;/2021/02/09/four-android-emulators-two-apps&quot;&gt;previous post on Android emulation options&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Note that the current (3.29) version of the tool has a small bug that results in a warning, which is &lt;a href=&quot;https://github.com/matlink/gplaycli/issues/272&quot;&gt;documented here&lt;/a&gt;. Other than that it does work as expected. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;I used the &lt;a href=&quot;https://linux.die.net/man/1/cmp&quot;&gt;cmp tool&lt;/a&gt; for this, with the command &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;cmp base.apk com.Triplee.TripleeSocial.apk&lt;/code&gt;. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;DROID Container Signature File 20201001.xml &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt; &lt;a href=&quot;#fnref:7:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;sup&gt;2&lt;/sup&gt;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;PRONOM version at the time of writing: DROID_SignatureFile_V97.xml, 1st October 2020. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;Tried with Android Studio 4.1.2, running under Linux Mint 19.3. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;See also the “Android Emulator for long-term access” section in my &lt;a href=&quot;/2021/02/09/four-android-emulators-two-apps&quot;&gt;previous blog post&lt;/a&gt;. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot;&gt;
      &lt;p&gt;For example using a service like &lt;a href=&quot;https://corellium.com/&quot;&gt;Corellium&lt;/a&gt;. &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;See e.g. &lt;a href=&quot;https://stackoverflow.com/a/29743193/1209004&quot;&gt;here&lt;/a&gt;, &lt;a href=&quot;https://www.reddit.com/r/jailbreak/comments/4dhbtb/question_ipa_location_in_ios_9/&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://medium.com/@lucideus/extracting-the-ipa-file-and-local-data-storage-of-an-ios-application-be637745624d&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;I used the &lt;a href=&quot;https://iosninja.io/ipa-library&quot;&gt;ioninja.io&lt;/a&gt; site. I have no idea about the site’s legal status or the safety of the downloads on offer, so proceed with caution! I only used the downloaded IPAs for some simple technical tests without installing them. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:11&quot;&gt;
      &lt;p&gt;This is typically the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Payload/Application.app&lt;/code&gt; folder (where “Application” is replaced with the app’s name). &lt;a href=&quot;#fnref:11&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:12&quot;&gt;
      &lt;p&gt;There is &lt;a href=&quot;https://github.com/matiassingers/ipa-metadata&quot;&gt;Ipa-metadata&lt;/a&gt;, which is a tool for extracting “metadata and provisioning info about an .ipa file”. Although I was able to install it, running it on any of my test files would just return a “Callback must be a function” error, and nothing else. &lt;a href=&quot;#fnref:12&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:13&quot;&gt;
      &lt;p&gt;The demo script that I wrote for my tests is &lt;a href=&quot;https://github.com/KBNLresearch/mobile-apps/blob/main/scripts/readplist.py&quot;&gt;available here&lt;/a&gt;. &lt;a href=&quot;#fnref:13&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:16&quot;&gt;
      &lt;p&gt;I’d be happy to have a go at this myself. &lt;a href=&quot;#fnref:16&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:15&quot;&gt;
      &lt;p&gt;This needs further confirmation from a more in-depth look at the available documentation. &lt;a href=&quot;#fnref:15&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:18&quot;&gt;
      &lt;p&gt;The theory and craft of digital preservation. Johns Hopkins University Press, 2018. Link: &lt;a href=&quot;http://www.trevorowens.org/theory-and-craft-of-digital-preservation/&quot;&gt;http://www.trevorowens.org/theory-and-craft-of-digital-preservation/&lt;/a&gt; &lt;a href=&quot;#fnref:18&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:19&quot;&gt;
      &lt;p&gt;ARize app Youtube channel: &lt;a href=&quot;https://www.youtube.com/channel/UCPEDDkRVC7jegjC02-7VsUw/videos&quot;&gt;https://www.youtube.com/channel/UCPEDDkRVC7jegjC02-7VsUw/videos&lt;/a&gt;; Immer app Youtube channel: &lt;a href=&quot;https://www.youtube.com/channel/UCnrnDrJ5MXccQJxaicEi7cA&quot;&gt;https://www.youtube.com/channel/UCnrnDrJ5MXccQJxaicEi7cA&lt;/a&gt;. &lt;a href=&quot;#fnref:19&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2021/02/24/towards-a-preservation-workflow-for-mobile-apps</link>
                <guid>https://bitsgalore.org/2021/02/24/towards-a-preservation-workflow-for-mobile-apps</guid>
                <pubDate>2021-02-24T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Four Android emulators, two apps</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/android-header.png&quot; alt=&quot;Header image&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Android_robot.svg&quot;&gt;&quot;Android Robot&quot;&lt;/a&gt; by Google Inc., used under &lt;a href=&quot;https://creativecommons.org/licenses/by/3.0&quot;&gt;CC BY 3.0&lt;/a&gt;, via Wikimedia Commons.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;So far the KB hasn’t actively pursued the preservation of mobile apps. However, born-digital publications in app-only form have become increasingly common, as well as “hybrid” publications, with apps that are supplemental to traditional (paper) books. At the request of our Digital Preservation department, I’ve started some exploratory investigations into how to preserve mobile apps in the near future. The &lt;a href=&quot;https://zenodo.org/record/3460450&quot;&gt;2019 iPres paper on the Acquisition and Preservation of Mobile eBook Apps&lt;/a&gt; by the British Library’s Maureen Pennock, Peter May and Michael Day provides an excellent starting point on the subject, and it highlights many of the challenges involved.&lt;/p&gt;

&lt;p&gt;Before we can start archiving mobile apps ourselves, some additional aspects need to be addressed in more detail. One of these is the question of how to ensure long-term access. &lt;a href=&quot;https://en.wikipedia.org/wiki/Emulator&quot;&gt;Emulation&lt;/a&gt; is the obvious strategy here, but I couldn’t find much information on the emulation of mobile platforms within a digital preservation context. In this blog post I present the results of some simple experiments, where I tried to emulate two selected apps. The main objective here was to explore the current state of emulation of mobile devices, and to get an initial impression of the suitability of some existing emulation solutions for long-term access.&lt;/p&gt;

&lt;p&gt;For practical reasons I’ve limited myself to the Android platform&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Attentive readers may recall I &lt;a href=&quot;/2014/10/23/running-archived-android-apps-pc-first-impressions&quot;&gt;briefly touched on this subject back in 2014&lt;/a&gt;. As much of the information in that blog post has now become outdated, this new post  presents a more up-to date investigation. I should probably mention here that I don’t own or use any Android device, or any other kind of smartphone or tablet for that matter&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. This probably makes me the worst possible person to evaluate Android emulation, but who’s going to stop me trying anyway? No one, that’s who!&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;android-emulation-options&quot;&gt;Android emulation options&lt;/h2&gt;

&lt;p&gt;The Emulation General Wiki gives &lt;a href=&quot;https://emulation.gametechwiki.com/index.php/Android_emulators&quot;&gt;a good overview of Android emulators&lt;/a&gt;. Most of these are closed-source, and the Wiki warns that some may come with malicious apps pre-installed. For the purposes of long-term preservation and acccess, open-source solutions are much more relevant. The Wiki lists the following open-source emulators:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://www.android-x86.org/&quot;&gt;Android-x86&lt;/a&gt; is a port of the &lt;a href=&quot;https://source.android.com/&quot;&gt;Android open source project&lt;/a&gt; to the &lt;a href=&quot;https://en.wikipedia.org/wiki/X86&quot;&gt;x86&lt;/a&gt; architecture. It is not an emulator, but rather an operating system that can be installed on either a physical device, or within a virtual machine (e.g. using VirtualBox or QEMU).&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://developer.android.com/studio/run/emulator&quot;&gt;Android Emulator&lt;/a&gt; is part of &lt;a href=&quot;https://developer.android.com/studio/&quot;&gt;Android Studio&lt;/a&gt;, Google’s offial development environment for Android. Android Emulator “simulates Android devices on your computer so that you can test your application on a variety of devices and Android API levels without needing to have each physical device”.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/anbox/anbox&quot;&gt;Anbox&lt;/a&gt; is “a container-based approach to boot a full Android system on a regular GNU/Linux system like Ubuntu”.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.shashlik.io/&quot;&gt;Shashlik&lt;/a&gt; is another project for running Android apps on Linux.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It’s worth pointing out that most of these solutions aren’t really “emulators” in a strict sense. Android-x86 is an operating system can be run inside a virtual machine (without the need for real hardware emulation) on x86-based platforms. Android Emulator can do full &lt;a href=&quot;https://en.wikipedia.org/wiki/ARM_architecture&quot;&gt;ARM hardware&lt;/a&gt; emulation, but is usually run with x86 system images. Anbox is not an emulator at all, but rather a &lt;a href=&quot;https://en.wikipedia.org/wiki/Compatibility_layer&quot;&gt;compatibility layer&lt;/a&gt;, similar to how &lt;a href=&quot;https://www.winehq.org/&quot;&gt;WINE&lt;/a&gt; allows one to run Windows applications on Unix-like operating systems. Shashlik uses a hybrid approach, by pairing a stripped-down Android base that is run within a modified version of QEMU with graphics rendering on the host machine&lt;sup id=&quot;fnref:14&quot;&gt;&lt;a href=&quot;#fn:14&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. For the sake of simplicity, I will use the term “emulation” for all of the above in this post.&lt;/p&gt;

&lt;h2 id=&quot;test-setup&quot;&gt;Test setup&lt;/h2&gt;

&lt;p&gt;The experiments described in the remainder of this post focus on the emulation of two selected test apps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;a href=&quot;https://arize.io/&quot;&gt;ARize&lt;/a&gt; is an “augmented reality” app. It is used, among other things, to add  3-D visualizations and animations to the Dutch-language children’s book “De avonturen van Max - Op zoek naar F…”.&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://immer.app/&quot;&gt;Immer&lt;/a&gt; is a book reading app that claims to “help[ing] you read more often, easily and enjoyably on your phone or tablet”.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I tried to emulate these apps using the following emulated environments:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Android-x86, running in VirtualBox&lt;/li&gt;
  &lt;li&gt;Android-86, running in QEMU&lt;/li&gt;
  &lt;li&gt;Anbox&lt;/li&gt;
  &lt;li&gt;Android Emulator (Android Studio)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I did not include Shashlik, as this project has been &lt;a href=&quot;https://github.com/shashlik&quot;&gt;inactive since 2016&lt;/a&gt;. I evaluated each of the emulations by the following (admittedly crude and incomplete) criteria:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Does the emulated environment work at all?&lt;/li&gt;
  &lt;li&gt;Is it possible to install the ARize and Immer apps?&lt;/li&gt;
  &lt;li&gt;Do the installed apps work?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I ran all tests on a regular desktop PC with the following characteristics:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Quad-core &lt;a href=&quot;https://ark.intel.com/content/www/us/en/ark/compare.html?productIds=88184&quot;&gt;Intel Core i5-6500 processor&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;12 GB RAM&lt;/li&gt;
  &lt;li&gt;Operating system: Linux Mint 19.3 (Tricia)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below follows a more detailed discussion of each of the emulated environments. I deliberately included quite a lot of detail here, mostly to make life easier for others who may want to do similar tests themselves. If you’re only interested in the main findings, you may want to skip these details and head right over to the “Summary and conclusions” section.&lt;/p&gt;

&lt;h2 id=&quot;android-x86--virtualbox&quot;&gt;Android-x86 + VirtualBox&lt;/h2&gt;

&lt;h3 id=&quot;setup&quot;&gt;Setup&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://www.sjoerdlangkemper.nl/2020/05/06/testing-android-apps-on-a-virtual-machine/&quot;&gt;This blog post by Sjoerd Langkemper&lt;/a&gt; gives a good overview of how to set up a virtual machine running Android-x86 with VirtualBox, and I largely followed the instructions mentioned here. I used VirtualBox verion 6.0.24, r139119. I had some difficulty getting a functional virtual machine from the latest Android-86 ISO installer image, so in the end I settled for the pre-built VirtualBox image from &lt;a href=&quot;https://www.osboxes.org/android-x86/&quot;&gt;osboxes.org&lt;/a&gt; (Android-x86 9.0-R2, 64-bit). Initially the virtual machine got stuck on a black screen at startup. I was able to fix this by setting the graphics controller (which can be found under “Display” in the VM settings) to “VBoxSVGA”. I also set the number of processors (“System” settings, “Processor” tab) to 2, as according to the &lt;a href=&quot;https://www.android-x86.org/documentation/virtualbox.html&quot;&gt;Android-86 documentation&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Processor(s) should be set above 1 if you have more than one virtual processor in your host system. Failure to do so means every single app (like Google Chrome) might crush if you try to use it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here’s what the virtual machine looks like after it has booted:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/vbox_android_startup.png&quot; alt=&quot;Home app, Android-x86 on VirtualBox.&quot; /&gt;
  &lt;figcaption&gt;Home app, Android-x86 on VirtualBox.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;At a first glance, everything pretty much works, although I did run into a number of crashes of the Chrome browser app. These all occurred while entering text into the address bar. Strangely, I couldn’t replicate these crashes in a later session, so I’m not sure about the exact cause.&lt;/p&gt;

&lt;h3 id=&quot;app-installation&quot;&gt;App installation&lt;/h3&gt;

&lt;p&gt;To install apps on Android, one would normally use the &lt;a href=&quot;https://play.google.com/store/apps&quot;&gt;Google Play Store&lt;/a&gt;. Within a typical preservation workflow, you’re more likely to have a local copy of the app’s &lt;a href=&quot;https://en.wikipedia.org/wiki/Android_application_package&quot;&gt;Android Package (APK)&lt;/a&gt;. Because of this, I tried to “emulate” this by using locally downloaded APKs for my experiments. To achieve this, I first used the &lt;a href=&quot;https://github.com/matlink/gplaycli&quot;&gt;gplaycli&lt;/a&gt; tool to download APK files of the test apps to my local (Linux) machine. Installing locally downloaded APKs on an emulated Android machine can be a bit tricky, as VirtualBox (and most other Android emulators) provides no easy way to set up shared folders between the host machine and the emulated device. The easiest method (which works for &lt;em&gt;all&lt;/em&gt; of the environments covered by this post) uses the &lt;a href=&quot;https://developer.android.com/studio/command-line/adb&quot;&gt;Android Debug Bridge (adb)&lt;/a&gt;, which is part of &lt;a href=&quot;https://developer.android.com/studio/&quot;&gt;Android Studio&lt;/a&gt;. &lt;a href=&quot;https://www.sjoerdlangkemper.nl/2020/05/06/testing-android-apps-on-a-virtual-machine/&quot;&gt;Langkemper’s blog post&lt;/a&gt; covers the use of the &lt;em&gt;adb&lt;/em&gt; tool in detail, so for brevity I’ll only show the basic steps here.&lt;/p&gt;

&lt;p&gt;First, we need to find the IP address of our virtual Android machine, which in my particular case turned out to be “127.0.0.1” (localhost)&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. Then we can use the Android Debug Bridge tool to connect to the virtual machine:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb connect 127.0.0.1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;If all goes well, you should see this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;connected to 127.0.0.1:5555
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If you get a “Connection refused” error instead, you might need to set a port forwarding rule for the virtual machine. In VirtualBox, go to “Network” in your VM’s settings. Click on “Advanced”, followed by “Port Forwarding”. Here, click on the green “+” icon (top-right). This adds a port forwarding rule. Now change the values of both “Host Port” and “Guest Port” to 5555 (defaults for both are 0):&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/vbox_ptforward.png&quot; alt=&quot;Setting port forwarding rules in VirtualBox.&quot; /&gt;
  &lt;figcaption&gt;Setting port forwarding rules in VirtualBox.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Then try to connect again.  Once the connection is established, you can install the local APK file with &lt;em&gt;adb&lt;/em&gt; using its “install” subcommand, with the name of the package file as an argument:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;com.Triplee.TripleeSocial.apk
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;results-for-test-apps&quot;&gt;Results for test apps&lt;/h3&gt;

&lt;p&gt;I was able to install the test apps without any problems, and their launcher icons showed up promptly after the installation:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/vbox_android_apps.png&quot; alt=&quot;App launchers after installation (note ARize and Immer icons).&quot; /&gt;
  &lt;figcaption&gt;App launchers after installation (note ARize and Immer icons).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;However, the ARize app crashed immediately after being launched:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/vbox_android_arize.png&quot; alt=&quot;ARize crash message after repeated launch attempts.&quot; /&gt;
  &lt;figcaption&gt;ARize crash message after repeated launch attempts.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;By contrast, the Immer app worked without any problems. Below are some screenshots that show Immer in action:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/vbox_android_immer.png&quot; alt=&quot;Immer welcome screen (Android-x86 + VirtualBox).&quot; /&gt;
  &lt;figcaption&gt;Immer welcome screen (Android-x86 + VirtualBox).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/vbox_android_immer_2.png&quot; alt=&quot;Immer book selection screen (Android-x86 + VirtualBox).&quot; /&gt;
  &lt;figcaption&gt;Immer book selection screen (Android-x86 + VirtualBox).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/vbox_android_immer_3.png&quot; alt=&quot;Immer book reading interface (Android-x86 + VirtualBox).&quot; /&gt;
  &lt;figcaption&gt;Immer book reading interface (Android-x86 + VirtualBox).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;android-x86--qemu&quot;&gt;Android-x86 + QEMU&lt;/h2&gt;

&lt;h3 id=&quot;setup-1&quot;&gt;Setup&lt;/h3&gt;

&lt;p&gt;Since the Debian packages of QEMU for the Linux Mint version I’m using are way out of date, I first compiled and installed the most recent (5.2) QEMU release from its source, using the instructions &lt;a href=&quot;https://www.qemu.org/download/#source&quot;&gt;here&lt;/a&gt;&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. For setting up Android-x86 within QEMU, I largely followed &lt;a href=&quot;https://linuxhint.com/android_qemu_play_3d_games_linux/&quot;&gt;this guide by Nitesh Kumar&lt;/a&gt; (starting from the &lt;em&gt;Android-x86 QEMU Installation Walkthrough&lt;/em&gt; section). However, these instructions initially failed for me, and I could trace this back to problems with QEMU’s OpenGL (which is related to graphics rendering) support&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;. After some experimentation, I managed to make it work by removing all OpenGL-related command line arguments. I’ll briefly summarize the setup steps here:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Download the Android-x86 ISO installer image from &lt;a href=&quot;https://osdn.net/projects/android-x86/releases&quot;&gt;here&lt;/a&gt; (I picked the 64-bit 9.0 R2 release).&lt;/li&gt;
  &lt;li&gt;Create a virtual hard disk (size: 8 GB):
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-img create &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; qcow2 androidx86_9_hda.img 8G
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Boot the Android-x86 live ISO image inside a virtual machine, attaching also the virtual hard disk:
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-system-x86_64 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-enable-kvm&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; 2048 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-smp&lt;/span&gt; 2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-cpu&lt;/span&gt; host &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-device&lt;/span&gt; ES1370 &lt;span class=&quot;nt&quot;&gt;-device&lt;/span&gt; virtio-mouse-pci &lt;span class=&quot;nt&quot;&gt;-device&lt;/span&gt; virtio-keyboard-pci &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-serial&lt;/span&gt; mon:stdio &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-boot&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;menu&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;on &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-net&lt;/span&gt; nic &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-net&lt;/span&gt; user,hostfwd&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;tcp::4444-:5555 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-hda&lt;/span&gt; androidx86_9_hda.img &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-cdrom&lt;/span&gt; android-x86_64-9.0-r2.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;Follow &lt;a href=&quot;https://linuxhint.com/android_qemu_play_3d_games_linux/&quot;&gt;Kumar’s guide&lt;/a&gt; (starting from the first screenshot) to install Android on the virtual machine.&lt;/li&gt;
  &lt;li&gt;After the installation process is completed, close down the virtual machine (you can do this by simply closing the QEMU window), and then re-start it using (same command as before, but omitting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-cdrom&lt;/code&gt; argument):
    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;qemu-system-x86_64 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-enable-kvm&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; 2048 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-smp&lt;/span&gt; 2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-cpu&lt;/span&gt; host &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-device&lt;/span&gt; ES1370 &lt;span class=&quot;nt&quot;&gt;-device&lt;/span&gt; virtio-mouse-pci &lt;span class=&quot;nt&quot;&gt;-device&lt;/span&gt; virtio-keyboard-pci &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-serial&lt;/span&gt; mon:stdio &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-boot&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;menu&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;on &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-net&lt;/span&gt; nic &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-net&lt;/span&gt; user,hostfwd&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;tcp::4444-:5555 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;-hda&lt;/span&gt; androidx86_9_hda.img
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And here’s what this looks like after start-up::&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/qemu_android_startup.png&quot; alt=&quot;Home app, Android-x86 on QEMU.&quot; /&gt;
  &lt;figcaption&gt;Home app, Android-x86 on QEMU.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;One oddity is that the rendering of colours doesn’t look quite right, with reds shown as shades of blue (this is even more apparent when you open some web pages with the Chrome app). Perhaps this could be remedied by using better device settings in QEMU, but I haven’t looked into this any further.&lt;/p&gt;

&lt;h3 id=&quot;app-installation-1&quot;&gt;App installation&lt;/h3&gt;

&lt;p&gt;Because of the slightly different network configuration, I had to add a reference to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;4444&lt;/code&gt; network port make to make the &lt;em&gt;adb&lt;/em&gt; connection to the QEMU machine:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb connect 127.0.0.1:4444
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After this, the package install procedure is identical to the one I showed for VirtualBox&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;8&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h3 id=&quot;results-for-test-apps-1&quot;&gt;Results for test apps&lt;/h3&gt;

&lt;p&gt;I was able to install both test apps without problems on the QEMU machine. As with VirtualBox, the ARize app consistently crashed after launch. The Immer app worked without any issues.&lt;/p&gt;

&lt;h2 id=&quot;anbox&quot;&gt;Anbox&lt;/h2&gt;

&lt;h3 id=&quot;setup-2&quot;&gt;Setup&lt;/h3&gt;

&lt;p&gt;I installed Anbox (version 4-56c25f1) by following the &lt;a href=&quot;https://github.com/anbox/anbox/blob/master/docs/install.md&quot;&gt;official Anbox documentation&lt;/a&gt;. After the installation, fire up the “Anbox Application Manager” from (depending on your Linux desktop) the desktop menu or launch bar&lt;sup id=&quot;fnref:10&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/anbox_launch.png&quot; alt=&quot;Anbox launcher.&quot; /&gt;
  &lt;figcaption&gt;Anbox launcher in Linux Mint desktop menu.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Unlike the other platforms covered in this post, Anbox doesn’t try to “emulate” a single device, but rather provides a compatibility layer that allows you to run Android apps from the Application Manager. Each app is launched in its own window, as shown below:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/anbox_desktop.png&quot; alt=&quot;Anbox Application Manager with calculator, file manager and clock apps.&quot; /&gt;
  &lt;figcaption&gt;Anbox Application Manager with calculator, file manager and clock apps.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Looking at the information under Settings, the current Anbox release is based on Android 7.1.1, which is quite an old version. A quick test with the pre-installed Webview Browser showed Anbox couldn’t connect to the internet, which apparently is a &lt;a href=&quot;https://github.com/anbox/anbox/issues/1724&quot;&gt;known issue&lt;/a&gt;. After some searching I found &lt;a href=&quot;https://wiki.archlinux.org/index.php/Anbox#Via_NetworkManager&quot;&gt;a workaround here&lt;/a&gt;. I simply ran the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;nmcli con add &lt;span class=&quot;nb&quot;&gt;type &lt;/span&gt;bridge ifname anbox0 &lt;span class=&quot;nt&quot;&gt;--&lt;/span&gt; connection.id &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
anbox-net ipv4.method shared ipv4.addresses 192.168.250.1/24
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I then closed and re-started the Anbox Application Manager, after which internet connectivity was working properly.&lt;/p&gt;

&lt;h3 id=&quot;app-installation-2&quot;&gt;App installation&lt;/h3&gt;

&lt;p&gt;The Anbox Application Manager automatically launches an ADB server process, so you don’t need to manually run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;adb connect&lt;/code&gt;. Other than that, you can use the regular &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;adb install&lt;/code&gt; commands to install the APK files.&lt;/p&gt;

&lt;h3 id=&quot;results-for-test-apps-2&quot;&gt;Results for test apps&lt;/h3&gt;

&lt;p&gt;My attempt to install the ARize app failed with this error message:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;adb: failed to install com.Triplee.TripleeSocial.apk:
Failure [INSTALL_FAILED_NO_MATCHING_ABIS: Failed to extract native libraries, res=-113]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;According to &lt;a href=&quot;https://stackoverflow.com/a/24572239&quot;&gt;this StackOverflow answer&lt;/a&gt;, this error can occur if an app uses native (e.g. ARM) libraries that are not compatible with the architecture of the (virtual) destination machine.&lt;/p&gt;

&lt;p&gt;The Immer app installed without any problems, and I was also able to launch it. However, the text in the app is partially rendered outside the app window:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/anbox_immer.png&quot; alt=&quot;Immer welcome screen, Anbox.&quot; /&gt;
  &lt;figcaption&gt;Immer welcome screen, Anbox.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;After I tried to re-size or maximize the app window, all text disappeared altogether. I was also unable to get the core book selection and reading functionality working (I just ended up with an empty screen), although clicking on the icon at the bottom did allow me to open and edit the app’s user profile.&lt;/p&gt;

&lt;h2 id=&quot;android-emulator-android-studio&quot;&gt;Android Emulator (Android Studio)&lt;/h2&gt;

&lt;h3 id=&quot;setup-3&quot;&gt;Setup&lt;/h3&gt;

&lt;p&gt;I installed Android Studio 4.1.2, which includes version 30.3.5.0 of Android Emulator. The emulator can be accessed from the “Tools” menu in the main Android Studio application. Here, the “AVD Manager” allows you to set up one or more &lt;a href=&quot;https://developer.android.com/studio/run/managing-avds&quot;&gt;Android Virtual Devices&lt;/a&gt;, each based on a user-specified “hardware profile, system image, storage area, skin, and other properties”. The setup process uses a wizard-like interface, which is pretty straightforward to use:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/android_emulator_avd.png&quot; alt=&quot;First step of Android Virtual Device setup process.&quot; /&gt;
  &lt;figcaption&gt;First step of Android Virtual Device setup process.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Nevertheless, it offers many customization options that allow you to mimic very specific device configurations. Both x86 and ARM system images are available for a variety of Android versions. The documentation advises against full ARM emulation because of the better performance of the x86 images. According the &lt;a href=&quot;https://developer.android.com/studio/releases/emulator#support_for_arm_binaries_on_android_9_and_11_system_images&quot;&gt;Emulator 30.0.0 release notes&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If you were previously unable to use the Android Emulator because your app depended on ARM binaries, you can now use the Android 9 x86 system image or any Android 11 system image to run your app – it is no longer necessary to download a specific system image to run ARM binaries. These Android 9 and Android 11 system images support ARM by default and provide dramatically improved performance when compared to those with full ARM emulation.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because of this recommendation, I set up a device with Android 9 (for better comparison with the Android-86 tests) using the x86 image. The emulator can then be launched from either Android Studio, or &lt;a href=&quot;https://developer.android.com/studio/run/emulator-commandline&quot;&gt;using the command line&lt;/a&gt;. On startup, the emulated device looks like this:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/android_emulator_startup.png&quot; alt=&quot;Home app, Android 9 on Android Emulator.&quot; /&gt;
  &lt;figcaption&gt;Home app, Android 9 on Android Emulator.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Unlike the previous Android-x86 emulations, which are based on the Android Open Source Project, Android Emulator uses the official system images by Google. As a result, the “look and feel” of the Android Emulator virtual devices is quite different, and they also come with a larger number of pre-installed apps. Because of this, Android Emulator most likely approximates the Android experience on a physical device more closely&lt;sup id=&quot;fnref:11&quot;&gt;&lt;a href=&quot;#fn:11&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;10&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h3 id=&quot;app-installation-3&quot;&gt;App installation&lt;/h3&gt;

&lt;p&gt;Like Anbox, Android Emulator automatically launches an ADB server process, so there’s no need to manually run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;adb connect&lt;/code&gt;. Installing APK files involves the usual &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;adb install&lt;/code&gt; commands.&lt;/p&gt;

&lt;h3 id=&quot;results-for-test-apps-3&quot;&gt;Results for test apps&lt;/h3&gt;

&lt;p&gt;Both the ARize and Immer apps installed without any problems. Unlike all other emulators in this test, I was able to run the ARize app, although with some limitations. The screenshot below shows the app’s “gallery” screen:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/arize_gallery.png&quot; alt=&quot;ARize app gallery.&quot; /&gt;
  &lt;figcaption&gt;ARize app gallery.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Clicking on an item in the gallery opens up a 3-D model, that can be manipulated by the user. As an example, here’s a model of a necklace:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/arize_necklace.png&quot; alt=&quot;3-D visualization of necklace in ARize app.&quot; /&gt;
  &lt;figcaption&gt;3-D visualization of necklace in ARize app.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;If I understand it correctly, clicking the “View AR” button should switch the app to “augmented reality” mode, where the 3-D model is combined with video from the phone’s camera. Although Android Emulator allows one to attach external cameras (in my case a webcam), this didn’t quite work for me, with the emulator reporting a warning&lt;sup id=&quot;fnref:13&quot;&gt;&lt;a href=&quot;#fn:13&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;. Whenever I tried to use the camera, the app showed a blank screen, after which the emulated device became unresponsive.&lt;/p&gt;

&lt;p&gt;The Immer app worked without any issues, as shown in the screenshot below:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/android_emulator_immer.png&quot; alt=&quot;Immer book reading interface (Android Emulator).&quot; /&gt;
  &lt;figcaption&gt;Immer book reading interface (Android Emulator).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Although far from perfect, these Android Emulator results still look promising, bearing in mind that none of the other tested emulators could even run the ARize app at all.&lt;/p&gt;

&lt;h2 id=&quot;summary-and-discussion-of-results&quot;&gt;Summary and discussion of results&lt;/h2&gt;

&lt;p&gt;The table below summarizes the main results of the emulation tests:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt; &lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Android-x86, VirtualBox&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Android-x86, QEMU&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Anbox&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Android Studio&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;Emulator version&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;6.0.24, r139119&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4-56c25f1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;30.3.5.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;Android version&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Android-x86 9.0-R2 (64-bit)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Android-x86 9.0-R2 (64-bit)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Customized system image based on v. 7.1.1 of Android Open Source Project&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Android 9.0 x86 system image (Google), API level 28&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;Emulation approach&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Virtualization&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Virtualization&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Compatibility layer&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Virtualization (full ARM emulation optional, depending on system image)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;ARize app installs&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;No&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;ARize app works&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;No&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;No&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Partially (camera device not recognised; emulator unresponsive after using camera)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;Immer app installs&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;Immer app works&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Partially (rendering and navigation issues)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;It is important to stress that the tests presented here are limited in scope and size, and should not be interpreted as representative of Android apps in general. With that in mind, it is possible to draw some tentative conclusions.&lt;/p&gt;

&lt;h3 id=&quot;android-x86-limitations&quot;&gt;Android-x86 limitations&lt;/h3&gt;

&lt;p&gt;First of all, the results suggest that emulation approaches based on Android-x86 may have some serious limitations. Going by various reports I found on sites like StackOverflow, the ARize app crashing on startup might be indicative of a more widespread problem. Sjoerd Langkemper mentions in his &lt;a href=&quot;https://www.sjoerdlangkemper.nl/2020/05/06/testing-android-apps-on-a-virtual-machine/&quot;&gt;blog post&lt;/a&gt; that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Testing on a virtual machine (VM) has some disadvantages. Testing on an actual Android phone is more reliable. Android is meant to run on ARM phones and not on x86 virtual machines, so things may randomly break when using a VM. Apps that ship with native libraries may not run at all in the VM, or they may run perfectly but don’t show up in the Play store.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A similar explanation, citing  the use of native ARM libraries that are not supported by Android-x86, is given &lt;a href=&quot;https://stackoverflow.com/a/60148570&quot;&gt;here&lt;/a&gt;. I’m not sure this is the culprit here, but if correct, this would seriously limit the usefulness of Android-x86 for long-term access.&lt;/p&gt;

&lt;h3 id=&quot;anbox-1&quot;&gt;Anbox&lt;/h3&gt;

&lt;p&gt;In its current form, Anbox seems of limited value for long-term access. With that said, I quite like its approach to providing access to Android apps, which is very different to the other emulators covered by this post. The project has an active developer community, and I’m curious how it will will develop in the future.&lt;/p&gt;

&lt;h3 id=&quot;android-emulator-for-long-term-access&quot;&gt;Android Emulator for long-term access&lt;/h3&gt;

&lt;p&gt;By contrast, Android Emulator (from Android Studio) could be a very interesting solution for emulating Android apps. It is the only emulator that was able to run both test apps (although with some issues in case of the ARize app). It also has an overall look and feel that is more faithful to a physical Android device. It’s worth pointing out here that under the hood, Android Emulator uses (a modified version of) QEMU&lt;sup id=&quot;fnref:12&quot;&gt;&lt;a href=&quot;#fn:12&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;12&lt;/a&gt;&lt;/sup&gt;. Right now I’m unable to judge whether Android Emulator would be truly suitable as a solution for long-term access. Some concerns:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Google has developed it for the sole purpose of allowing Android developers to test their apps on a variety of (virtual) devices. It’s unlikely that Google will keep developing or maintaining it beyond Android’s lifetime. For long-term access, this implies that some organization should take over the maintenance of (a fork of) the software from that point onwards (assuming this is allowed under its licensing conditions, but see below).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://developer.android.com/studio/terms.html&quot;&gt;Terms and conditions&lt;/a&gt; of the Android Software Development Kit (of which Android Emulator is a part) state that:&lt;/p&gt;

    &lt;blockquote&gt;
      &lt;p&gt;3.1 Subject to the terms of the License Agreement, Google grants you a limited, worldwide, royalty-free, non-assignable, non-exclusive, and non-sublicensable license to use the SDK solely to develop applications for compatible implementations of Android.&lt;/p&gt;
    &lt;/blockquote&gt;

    &lt;p&gt;and:&lt;/p&gt;

    &lt;blockquote&gt;
      &lt;p&gt;3.2 You may not use this SDK to develop applications for other platforms (including non-compatible implementations of Android) or to develop another SDK. You are of course free to develop applications for other platforms, including non-compatible implementations of Android, provided that this SDK is not used for that purpose.&lt;/p&gt;
    &lt;/blockquote&gt;

    &lt;p&gt;To my legally untrained eyes, this appears to rule out the use of any of the components of the Android SDK for preservation and long-term access.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Also, Android Studio’s licensing information mentions that it includes “proprietary code subject to [a] separate license”. It’s not clear to me if this affects the Emulator component. The emulator’s subdirectory contains an additional +3000-line file with licensing information that applies specifically to the emulator component. I haven’t gone through it in detail (and am not planning to do so), but the licensing situation does look somewhat complex.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I can’t really assess to what extent this may impact the (future) use of Android Emulator as a long-term access solution, but I’d be interested in the opinion of any legal experts who may be reading this.&lt;/p&gt;

&lt;h3 id=&quot;external-dependencies&quot;&gt;External dependencies&lt;/h3&gt;

&lt;p&gt;On a final note, it’s important to stress that the ability to run an app in an emulated environment is only one part of the preservation puzzle, and by itself it doesn’t guarantee its accessibility over time. As Pennock, May &amp;amp; Day write:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If the app is to be acquired in the most robust and complete form possible then we must find some way to deal with apps which have an inherent reliance on content hosted externally to the app. These are likely to lose their integrity over time, particularly as linkage to archived web content does not yet (if at all) appear to have become standard practice in apps.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This also applies to both the ARize and Immer apps, both of which rely on externally hosted content. To illustrate this, after disabling the internet connection on my PC,  the ARize app showed this on start-up:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2021/02/arize_noconnection.png&quot; alt=&quot;Startup screen of ARize app after disabling internet connection.&quot; /&gt;
  &lt;figcaption&gt;Startup screen of ARize app after disabling internet connection.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Immer simply started up with a blank screen. So, without (access to) the externally hosted resources, both apps are essentially useless.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/3460450&quot;&gt;Considerations on the Acquisition and Preservation of Mobile eBook Apps&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.sjoerdlangkemper.nl/2020/05/06/testing-android-apps-on-a-virtual-machine/&quot;&gt;Testing Android apps on a virtual machine&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://linuxhint.com/android_qemu_play_3d_games_linux/&quot;&gt;How to Run Android in QEMU to Play 3D Android Games on Linux&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://astr0baby.wordpress.com/2019/07/09/android-8-1-in-qemu-and-burp-suite-ssl-interception/&quot;&gt;Android 8.1 in qemu and Burp Suite SSL interception&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://developer.android.com/studio/run/managing-avds&quot;&gt;Create and manage virtual devices in Android Emulator&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;The proprietary nature of iOS severely constrains any emulation options; I may address this in a future blog post. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;As a matter of fact I’m still using &lt;a href=&quot;https://en.wikipedia.org/wiki/Motorola_C139&quot;&gt;this basic dumb phone&lt;/a&gt;, which I bough back in 2006. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:14&quot;&gt;
      &lt;p&gt;See &lt;a href=&quot;http://www.shashlik.io/what-is/&quot;&gt;http://www.shashlik.io/what-is/&lt;/a&gt;. &lt;a href=&quot;#fnref:14&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;This instruction video shows how this works &lt;a href=&quot;https://youtu.be/h4syCHftyCs&quot;&gt;https://youtu.be/h4syCHftyCs&lt;/a&gt;. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;Finding the correct IP address can be a bit tricky. Langkemper’s blog suggests to either look at Android’s Wi-Fi preferences, or to run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ip a&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ifconfig&lt;/code&gt; in the Android terminal emulator app. However, in my case the value value shown in the Wi-Fi preferences is “10.0.2.15”, which is not recognised by &lt;em&gt;adb&lt;/em&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ifconfig&lt;/code&gt; command reports 3 different entries (“wlan0”, “wifi_eth” and “lo”); eventually I found the value of the “lo” (“local loopback”) entry (“127.0.0.1”) did the trick. So you might need to experiment a bit to make things work. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;Compilation of QEMU requires Python &lt;a href=&quot;https://ninja-build.org/&quot;&gt;Ninja package&lt;/a&gt;, so install this first by running &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;python3 -m pip install --user ninja&lt;/code&gt;. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;E.g. see &lt;a href=&quot;https://forums.opensuse.org/showthread.php/539026-Can-t-enable-opengl-on-the-qemu-machine&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://bugzilla.redhat.com/show_bug.cgi?id=1867343&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;If you are connected to multiple devices at the same time, you’ll need to add the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-s&lt;/code&gt; switch to your &lt;em&gt;adb&lt;/em&gt; calls to specify the target device. See &lt;a href=&quot;https://developer.android.com/studio/command-line/adb#directingcommands&quot;&gt;the documentation&lt;/a&gt; for details. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot;&gt;
      &lt;p&gt;This might seem obvious, but it’s not really clear from the documentation, so I thought I’d just mention it. &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:11&quot;&gt;
      &lt;p&gt;But bear in mind I didn’t have any physical Android devices available while doing these tests. &lt;a href=&quot;#fnref:11&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:13&quot;&gt;
      &lt;p&gt;“Camera name ‘webcam0’ is not found in the list of connected cameras”. &lt;a href=&quot;#fnref:13&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:12&quot;&gt;
      &lt;p&gt;This &lt;a href=&quot;https://wiki.diebin.at/Under_the_hood_of_Android_Emulator_(appcert).html&quot;&gt;“Under the hood of Android Emulator”&lt;/a&gt; Wiki entry shows how QEMU is used within the emulator (note that it hasn’t been updated since 2011, so it may be well out of date). &lt;a href=&quot;#fnref:12&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2021/02/09/four-android-emulators-two-apps</link>
                <guid>https://bitsgalore.org/2021/02/09/four-android-emulators-two-apps</guid>
                <pubDate>2021-02-09T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Mapping the Dutch web domain</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/09/wwwspace.png&quot; alt=&quot;Satellite image of Wadden Sea&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:World_Wide_Web_-_Digital_Preservation.png&quot;&gt;“World Wide Web - Digital Preservation”&lt;/a&gt; by &lt;a href=&quot;https://www.wikidata.org/wiki/Q55754361&quot;&gt;Jørgen Stamp&lt;/a&gt;, used under &lt;a href=&quot;https://creativecommons.org/licenses/by/2.5/dk/deed.en&quot;&gt;CC BY 2.5 DK&lt;/a&gt;; &lt;a href=&quot;https://commons.wikimedia.org/wiki/File:Wadden_Sea.jpg&quot;&gt;“Wadden Sea”&lt;/a&gt; by Envisat satellite, used under &lt;a href=&quot;https://creativecommons.org/licenses/by-sa/3.0-igo&quot;&gt;CC BY-SA 3.0-IGO&lt;/a&gt;.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Earlier this year I wrote &lt;a href=&quot;/2020/02/11/web-domain-geolocation-and-spatial-analysis&quot;&gt;a blog post&lt;/a&gt; about geo-locating web domains, and the subsequent analysis of the resulting data in &lt;em&gt;QGIS&lt;/em&gt;. At the time, this work was meant as a proof of concept, and I had only tried it out on a small set of test data. We have now applied this methodology to the whole of the Dutch (&lt;em&gt;.nl&lt;/em&gt;) web domain, and this follow-up post presents the results of this exercise.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;identification-of-frisian-sites-in-the-nl-domain&quot;&gt;Identification of Frisian sites in the .nl domain&lt;/h2&gt;

&lt;p&gt;Briefly after I wrote my earlier blog post on geo-locating web domains, my web archiving colleagues expressed an interest in applying the procedure to the Dutch top-level &lt;em&gt;.nl&lt;/em&gt; domain in its entirety. The immediate occasion for this was an ongoing &lt;a href=&quot;https://newyear.isoc.nl/2019/pres/2019-01-017-03-KeesTeszelsky-FRL.pdf&quot;&gt;initiative&lt;/a&gt; to map and harvest the Frisian web domain. &lt;a href=&quot;https://en.wikipedia.org/wiki/Friesland&quot;&gt;Friesland&lt;/a&gt; (Fryslân) is a province in the northern part the Netherlands. It has &lt;a href=&quot;https://opendata.cbs.nl/statline/#/CBS/nl/dataset/37230ned/table?ts=1599490582090&quot;&gt;about 650 thousand inhabitants&lt;/a&gt;, over half of which speak &lt;a href=&quot;https://en.wikipedia.org/wiki/West_Frisian_language&quot;&gt;West Frisian&lt;/a&gt; as their native language. In 2014 a dedicated Frisian &lt;em&gt;.frl&lt;/em&gt; web domain was established, and in 2019 this domain spanned about 15,000 top-level domain sites, with mostly Frisian-language content&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. As many Frisian web sites are still part of other top-level domains such as the “regular” &lt;em&gt;.nl&lt;/em&gt; domain, the identification of such sites could provide valuable additional information about the Frisian web. The reasoning behind this is, that if an &lt;em&gt;.nl&lt;/em&gt; domain is hosted somewhere in Friesland, chances are it corresponds to a web site that either uses the Frisian-language, or is related to Friesland in some other way. Since we already had access to a list of all &lt;em&gt;.nl&lt;/em&gt; domains (provided to us by registry operator &lt;a href=&quot;https://www.sidn.nl/en/theme/about-sidn&quot;&gt;Stichting Internet Domeinregistratie Nederland&lt;/a&gt;), it seemed viable to extend the earlier geo-location exercise to the entire &lt;em&gt;.nl&lt;/em&gt; top-level domain.&lt;/p&gt;

&lt;h2 id=&quot;geolocation-of-web-domains&quot;&gt;Geolocation of web domains&lt;/h2&gt;

&lt;p&gt;Geo-locating a web domain involves two steps&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;IP lookup&lt;/strong&gt;: for each domain, we need to establish its corresponding IP address. There are various ways (and tools) to do this, but for the current analysis I used the Unix &lt;a href=&quot;https://linux.die.net/man/1/host&quot;&gt;&lt;em&gt;host&lt;/em&gt; tool&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Lookup geolocation data&lt;/strong&gt;: in this step we look up the geolocation attributes (latitude, longitude, but also country and city names) from each identified IP address. Here I used the &lt;a href=&quot;https://github.com/maxmind/GeoIP2-python&quot;&gt;&lt;em&gt;GeoIP2&lt;/em&gt;&lt;/a&gt; Python module, with the  &lt;a href=&quot;https://dev.maxmind.com/geoip/geoip2/geolite2&quot;&gt;&lt;em&gt;GeoLite2&lt;/em&gt; City database&lt;/a&gt; (freely available after registration). This database is widely used; for example, the British Library used it for geo-locating their 2014 domain crawl&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;processing-millions-of-domains-in-the-era-of-covid-19&quot;&gt;Processing millions of domains in the era of COVID-19&lt;/h2&gt;

&lt;p&gt;I first tried to run the geolocation script that I also used in my earlier blog post, using the full list of 5.86 million &lt;em&gt;.nl&lt;/em&gt; domains as an input file. I started the script on the 2nd of March on a local machine in my office at the KB, and by extrapolating the throughput after one day, I estimated it would need about 10 days to finish. Unfortunately, by that time the KB had closed down due to the COVID-19 outbreak, which meant I was unable to ever retrieve the results any time soon. I briefly considered to re-run the script from home, but decided against it for practical reasons&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Then some months later my colleague Kees Teszelszky (who initiated the activities on the Frisian web domain) suggested we might use the services of &lt;a href=&quot;https://www.surf.nl/en/services-and-support/purchasing-services-from-surfsara&quot;&gt;SURFsara&lt;/a&gt;, which hosts several cloud computing facilities that can be used free of charge by Dutch research institutions. After getting in touch with them, SURsara gave us access to its &lt;a href=&quot;https://doc.hpccloud.surfsara.nl/&quot;&gt;HPC Cloud&lt;/a&gt; platform. Here I created a simple virtual machine running Ubuntu Server, and then installed the dependencies needed by the geolocation script.&lt;/p&gt;

&lt;h2 id=&quot;parallelization&quot;&gt;Parallelization&lt;/h2&gt;

&lt;p&gt;To improve the performance of the script itself, I also parallelized its IP lookup component (which would take up most of its running time). Tests on an early version of the parallelized script resulted in a nasty memory leak when run with the full domains list. Even though I was able to fix this (albeit in a rather ugly way), tests with the &lt;a href=&quot;https://github.com/KBNLresearch/geolocatedomains/blob/master/scripts/geolocatedomains.py&quot;&gt;improved script&lt;/a&gt; still showed a slow increase of memory consumption over time. I was not sure how this increase would develop over the course of processing 5.86 million domains, so to be on the safe side I simply split the domains list into 6 smaller files of 1 million domains each, and then ran the script consecutively on all these files.&lt;/p&gt;

&lt;h2 id=&quot;geolocation-results&quot;&gt;Geolocation results&lt;/h2&gt;

&lt;p&gt;The processing of all 6 files completed without problems in about 54 hours. The main result of this is a set of 6 comma-delimited text files, with the following fields:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;domain&lt;/strong&gt;: name of the web domain&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;hasValidIP&lt;/strong&gt;: Boolean flag that indicates whether the domain could be mapped to an IP address&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;countryIsoCode&lt;/strong&gt;: country ISO code (if available)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;cityName&lt;/strong&gt;: city name (if available)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;latitude&lt;/strong&gt; / &lt;strong&gt;longitude&lt;/strong&gt;: latitude and longitude in decimal degrees (if available)&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;accuracyRadius&lt;/strong&gt;: indicator of the accuracy of the reported latitude / longitude pair in kilometers (if available)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that this does not tell us yet in which province each web domain is hosted. However, it’s pretty easy to add this information, as I will show below.&lt;/p&gt;

&lt;h2 id=&quot;adding-the-province-information&quot;&gt;Adding the province information&lt;/h2&gt;

&lt;p&gt;In order to establish the corresponding province of each domain, I used &lt;a href=&quot;https://www.qgis.org/&quot;&gt;QGIS&lt;/a&gt;, an open-source geographical information system, and a freely available &lt;a href=&quot;https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/e73b01f6-28c7-4bb7-a782-e877e8113e2c&quot;&gt;vector layer of Dutch province boundaries&lt;/a&gt; in &lt;a href=&quot;https://en.wikipedia.org/wiki/Shapefile&quot;&gt;Shapefile&lt;/a&gt; format. In this Shapefile each province is represented as a polygon &lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. After opening the shapefile, I imported the first geolocation CSV file. I subsequently used the &lt;em&gt;Join Attributes by Location …&lt;/em&gt; function to add, for each domain point location, the value of the corresponding Province field from the Shapefile (if available). I then exported the result to a new CSV file (see my &lt;a href=&quot;/2020/02/11/web-domain-geolocation-and-spatial-analysis&quot;&gt;earlier post&lt;/a&gt; for details). I repeated this for each of the 6 CSV files. Finally I combined the 6 resulting output CSV files into one single file.&lt;/p&gt;

&lt;p&gt;After importing this file into QGIS again (which is slow) I was able to create the following visualization of all domain hosting locations in the immediate vicinity of the Netherlands:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/09/domains-map.png&quot; alt=&quot;Map of identified server locations&quot; /&gt;
  &lt;figcaption&gt;Map of identified hosting locations (locations outside the immediate vicinity of the Netherlands not shown).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Note that one hosting location may represent many domains, so the number of visible dots does not reflect the actual number of domains. Also, I should perhaps stress here that these are all &lt;em&gt;hosting&lt;/em&gt; locations (i.e. the location of the service provider that hosts a domain), &lt;em&gt;not&lt;/em&gt; locations of individual domain owners!&lt;/p&gt;

&lt;h2 id=&quot;spatial-distribution-of-the-dutch-web-domain&quot;&gt;Spatial distribution of the Dutch web domain&lt;/h2&gt;

&lt;p&gt;The results of this geolocation exercise give some interesting insights into the spatial distribution of the Dutch web domain. Below chart shows the distribution of hosting countries (based on the country ISO codes from the GeoLite2 database):&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/09/countries.png&quot; alt=&quot;Pie chart of active domain counts by country&quot; /&gt;
  &lt;figcaption&gt;Active domain counts by country.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Here are the same data as a table:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Country&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Count&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;% of all active domains&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;NL&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3.45617e+06&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;75.03&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;DE&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;419623&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;9.11&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;US&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;235962&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;5.12&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;IE&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;180379&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3.92&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;FR&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;78023&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.69&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;DK&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;68584&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.49&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;BE&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;42333&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.92&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;GB&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;26748&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.58&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;LU&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;15569&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.34&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Other&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;56489&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.17&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This shows that almost 25% of all active &lt;em&gt;.nl&lt;/em&gt; domains are hosted from locations outside the Netherlands (Germany, United States and Ireland are the most popular hosting countries). Zooming in on the Netherlands only, the following chart shows the distribution of  provinces from which domains are hosted:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/09/provinces.png&quot; alt=&quot;Pie chart of active domain counts by province&quot; /&gt;
  &lt;figcaption&gt;Active domain counts by province (% of NL-hosted domains).&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;And again in table form:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Province&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;Count&lt;/th&gt;
      &lt;th style=&quot;text-align: right&quot;&gt;% of all NL-hosted domains&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Noord-Holland&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;2.60368e+06&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;75.34&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Zuid-Holland&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;270357&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;7.82&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Gelderland&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;221120&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;6.4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Noord-Brabant&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;133056&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3.85&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Utrecht&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;62215&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.8&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Flevoland&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;53595&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;1.55&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Overijssel&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;33781&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.98&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Groningen&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;28392&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.82&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Fryslân&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;20184&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.58&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Drenthe&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;14971&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.43&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Limburg&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;11154&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.32&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Zeeland&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;3325&lt;/td&gt;
      &lt;td style=&quot;text-align: right&quot;&gt;0.1&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;It’s remarkable to see that over 75% of all NL-hosted domains are hosted from the province of Noord-Holland (North Holland) alone. This might simply reflect the concentration of internet hosting companies in especially the Amsterdam area. At only about 20 thousand domains, Friesland (Fryslân) is one of the “smallest” provinces in terms of this analysis.&lt;/p&gt;

&lt;h2 id=&quot;discussion&quot;&gt;Discussion&lt;/h2&gt;

&lt;p&gt;In this blog post I have outlined the geo-location and data processing procedure, and I have also shown how the results of this exercise can reveal spatial characteristics of a top-level web domain. However, it’s important to be aware of the limitations of the source data used here, as well as the methodology as such.&lt;/p&gt;

&lt;p&gt;First of all there’s the accuracy of the geolocation results. The latitude and longitude values in the GeoLite2 database are only approximate, and refer to larger spatial entities, ranging from cities to even whole countries. As a measure of the approximate accuracy at any location, it reports an “accuracy radius” value, which is &lt;a href=&quot;https://www.maxmind.com/en/geoip2-precision-insights&quot;&gt;defined as&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The approximate accuracy radius, in kilometers, around the latitude and longitude for the geographical entity (country, subdivision, city or postal code) associated with the IP address.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The figure below shows the distribution of accuracy radius values for all active domains that are part of the &lt;em&gt;.nl&lt;/em&gt; domain:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/09/accuracyradius.png&quot; alt=&quot;Bar chart of distribution of accuracy radius values for active domains&quot; /&gt;
  &lt;figcaption&gt;Distribution of accuracy radius values for active domains.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The underlying data show an accuracy radius value of 100 km or more for 96% of all active domains. If these values reflect the true accuracy of the database, the results of the current analysis should be interpreted with a good deal of caution, especially for a small country like the Netherlands. It might be able to get better results with MaxMind’s (the company behind the GeoLite2 database) higher-precision databases, which are available commercially.&lt;/p&gt;

&lt;p&gt;Another thing to keep in mind, is that not all active domains have an associated web site. For example, some domains are only used for e-mail (I actually own such an email-only domain myself). For this analysis I haven’t investigated this any further.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://www.sidn.nl/en/theme/about-sidn&quot;&gt;Stichting Internet Domeinregistratie Nederland&lt;/a&gt; provided us the top-level &lt;em&gt;.nl&lt;/em&gt; domain data on which this analysis is based. &lt;a href=&quot;https://www.surf.nl/en/services-and-support/purchasing-services-from-surfsara&quot;&gt;SURFsara&lt;/a&gt; (and in particular Nuno Ferreira) is thanked for allowing us to use their HPC Cloud computing facilities, and for their help in getting things running. Thanks are also due to Kees Teszelszky for his help with this work.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/geolocatedomains/blob/master/scripts/geolocatedomains.py&quot;&gt;Geolocation script&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/geolocatedomains/blob/master/scripts/analyze.py&quot;&gt;Analysis script for QGIS-produced CSV file&lt;/a&gt; - This script automatically creates a report with most of the table data presented in this blog. It also exports the Frisian records to a subset.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/geolocatedomains/blob/master/notes/notes.md&quot;&gt;Unedited working notes&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/2020/02/11/web-domain-geolocation-and-spatial-analysis&quot;&gt;Web domain geolocation and spatial analysis with QGIS&lt;/a&gt;&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;See Kees Teszelszky - &lt;a href=&quot;https://doi.org/10.34894/PO7PZD&quot;&gt;‘Alle Begjin Is Swier’: The Use Of The Frisian Web Domain Web Data For Digital Humanities Research&lt;/a&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;See my &lt;a href=&quot;/2020/02/11/web-domain-geolocation-and-spatial-analysis&quot;&gt;earlier blog post&lt;/a&gt; for a more detailed discussion. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Geo-location in the 2014 UK Domain Crawl: &lt;a href=&quot;https://blogs.bl.uk/webarchive/2015/07/geo-location-in-the-2014-uk-domain-crawl.html&quot;&gt;https://blogs.bl.uk/webarchive/2015/07/geo-location-in-the-2014-uk-domain-crawl.html&lt;/a&gt; &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Mostly related to network stability, and the fact that I needed the two machines I have available at home for other things that would most likely interfere with the geolocation procedure. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;According to its &lt;a href=&quot;https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/e73b01f6-28c7-4bb7-a782-e877e8113e2c?tab=general&quot;&gt;description&lt;/a&gt;, the provinces file is suitable for map scales ranging from 1:750,000 to 1:1,000,000, which suggests a (very approximate) spatial accuracy of about 1 km, which is adequate for this purpose. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;Note that some of the content of this blog post is outdated, because it is based on an older version of QGIS. For example, with the current (3.14) version I did not need to perform the (somewhat tedious) coordinate system transformation step. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2020/09/09/mapping-the-dutch-web-domain</link>
                <guid>https://bitsgalore.org/2020/09/09/mapping-the-dutch-web-domain</guid>
                <pubDate>2020-09-09T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Restoring Liesbet's Virtual Home, a digital treasure from the early Dutch web</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/06/liesbet-door.png&quot; alt=&quot;Liesbet door&quot; /&gt;
  &lt;figcaption&gt;Original artwork copyright &amp;copy;Liesbet Zikkenheimer.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;In 2019, Dutch telecommunications company &lt;a href=&quot;https://en.wikipedia.org/wiki/KPN&quot;&gt;KPN&lt;/a&gt; announced its plans to phase out its subsidiary &lt;a href=&quot;https://en.wikipedia.org/wiki/XS4ALL&quot;&gt;XS4ALL&lt;/a&gt;, which is one of the oldest internet service providers in the Netherlands. With this decision, thousands of homepages and personal web sites that are hosted under the XS4ALL domain are at risk of disappearing forever. The web archiving team of the National Library of the Netherlands (KB) has started an initiative to rescue a selection of these homepages, which includes some of the oldest born-digital publications of the Dutch web. This blog post describes an attempt to rescue and restore one of the oldest and most unique homepages from this collection: Liesbet’s Virtual Home (Liesbet’s Atelier), the personal web site of Dutch Internet pioneer Liesbet Zikkenheimer, which has a history that goes back to 1995. First I give some background information about XS4ALL, and the KB-led rescue initiative. Then I move on to the various (mostly technical) aspects of restoring Liesbet’s Virtual Home. Finally, I address the challenges of capturing the restored site to an ingest-ready &lt;a href=&quot;https://en.wikipedia.org/wiki/Web_ARChive&quot;&gt;WARC&lt;/a&gt; file.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;xs4all&quot;&gt;XS4ALL&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/XS4ALL&quot;&gt;XS4ALL&lt;/a&gt; is one of the oldest internet service providers in the Netherlands, and even one of the oldest providers in the whole world. The company was founded in 1993, and has its roots in the Dutch hacker scene. Since its inception, many pioneers of the Dutch internet hosted their homepages under the XS4ALL web domain. Some of these homepages have been online for more than 25 years, and as such they rank among the oldest born-digital publications of the Dutch web. In 2019, parent company KPN (which bought XS4ALL in 1998) announced their intention to phase out the XS4ALL brand. Eventually, all of the company’s services will continue under the KPN brand. This poses an acute threat to much of the (often unique) digital heritage that is hosted under the XS4ALL domain, as in the past a similar situation with provider &lt;a href=&quot;https://nl.wikipedia.org/wiki/EuroNet&quot;&gt;Euronet&lt;/a&gt; resulted in &lt;a href=&quot;https://www.tandfonline.com/doi/full/10.1080/24701475.2019.1603951&quot;&gt;the loss of thousands of early Dutch homepages&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;rescuing-the-xs4all-homepages&quot;&gt;Rescuing the XS4ALL homepages&lt;/h2&gt;

&lt;p&gt;So, the KB’s &lt;a href=&quot;https://www.kb.nl/en/organisation/research-expertise/long-term-usability-of-digital-resources/web-archiving&quot;&gt;web archiving team&lt;/a&gt; started an initiative to rescue a selection of the XS4ALL homepages, and add them to the web archive in a special XS4ALL collection. This initiative is supported financially by the &lt;a href=&quot;https://www.sidnfonds.nl/excerpt&quot;&gt;SIDN Fund&lt;/a&gt; and &lt;a href=&quot;https://www.stichtinginternet4all.nl/&quot;&gt;stichting Internet4all&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;digital-treasures&quot;&gt;Digital treasures&lt;/h2&gt;

&lt;p&gt;As of May 2020, the KB has selected &lt;a href=&quot;https://www.kb.nl/blogs/duurzame-toegang/bewaren-voor-iedereen-de-opbouw-van-de-webcollectie-xs4all-homepages&quot;&gt;3370 homepages&lt;/a&gt; for inclusion in XS4ALL collection. Out of the homepages that have been archived so far, 404 are marked as “digital treasures”. These homepages are remarkable because of their age, or because of characteristics that are either unique, or, on the other hand, typical of a particular era or trend.&lt;/p&gt;

&lt;h2 id=&quot;liesbets-virtual-home&quot;&gt;Liesbet’s Virtual Home&lt;/h2&gt;

&lt;p&gt;One of these “treasures” is &lt;a href=&quot;https://ziklies.home.xs4all.nl/&quot;&gt;Liesbet’s Virtual Home&lt;/a&gt; (in Dutch: Liesbet’s Atelier). This is the old homepage of &lt;a href=&quot;http://zicnet.nl/&quot;&gt;Liesbet Zikkenheimer&lt;/a&gt;, a Dutch Internet pioneer with a background in industrial and graphic design. Her homepage is a “treasure” for several reasons. First of all, it has a history that goes back to 1995, which makes it one of the oldest Dutch homepages that are still available today. Second, Zikkenheimer is an important figure in the history of the Dutch internet. To mention a few examples, in 1997 she developed and published &lt;a href=&quot;https://web.archive.org/web/19980526231446/http://www.libelle.nl/libelle/dezeweek/dezeweek.html&quot;&gt;the online version of popular women’s magazine Libelle&lt;/a&gt;. She also created web sites for the Margriet and Viva magazines, and developed, published and managed several well-known web portals, most of which were primarily targeted at women. Finally, Liesbet’s Virtual Home is unique because of its structure and design. The site is literally structured like a physical house. Each page represents a particular room, and to get from, say, the living room to the loft, one needs to navigate through a hallway and two flights of stairs. It also has some interactive features that were quite unique at the time of is creation. So, the site meets every possible “digital treasure” criterion.&lt;/p&gt;

&lt;h2 id=&quot;problems-with-the-live-site&quot;&gt;Problems with the live site&lt;/h2&gt;

&lt;p&gt;Even though Liesbet’s Virtual Home &lt;a href=&quot;https://ziklies.home.xs4all.nl/&quot;&gt;is still online&lt;/a&gt;, several features of the site are no longer working. In particular:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The navigation on several pages (including the landing page) is broken, because it is based on server-side &lt;a href=&quot;https://en.wikipedia.org/wiki/Image_map&quot;&gt;image maps&lt;/a&gt; that are no longer available.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Some interactive features like &lt;a href=&quot;https://ziklies.home.xs4all.nl/slaapk/e-slaap1.html&quot;&gt;this bedroom mirror&lt;/a&gt; don’t work anymore, because the underlying scripts are missing.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This raised the question whether it would be possible to create a “restored” version of Liesbet’s Virtual Home that has these features working again.&lt;/p&gt;

&lt;h2 id=&quot;local-copy-of-site-data&quot;&gt;Local copy of site data&lt;/h2&gt;

&lt;p&gt;After my colleague Kees Teszelszky got in contact with Zikkenheimer, she sent us a ZIP file with a locally stored copy of the site’s directory structure. However, that local copy had several issues as well, and it quickly became obvious it couldn’t be used as a basis for a restored version of the site. However, the ZIP file did contain both the image map files as well as the scripts that are missing from the live site.&lt;/p&gt;

&lt;h2 id=&quot;crawling-the-live-site&quot;&gt;Crawling the live site&lt;/h2&gt;

&lt;p&gt;So, I decided to take the current live site as a starting point. I first tried to crawl and capture the site with &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/scripts/scrapesite.sh&quot;&gt;this simple Bash script&lt;/a&gt; that uses the &lt;a href=&quot;https://www.gnu.org/software/wget/&quot;&gt;wget&lt;/a&gt; tool. At first sight this seemed to work reasonably well, but a closer inspection revealed that various components of the site were missing. A few examples:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://ziklies.home.xs4all.nl/e-toilet.html&quot;&gt;toilet&lt;/a&gt; pages are not referenced from their &lt;a href=&quot;https://ziklies.home.xs4all.nl/e-start.html&quot;&gt;parent pages&lt;/a&gt;. This has the result that wget never finds them.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Some components are only referenced through Javascript. For example, &lt;a href=&quot;https://ziklies.home.xs4all.nl/woonk/woon03.html&quot;&gt;this page&lt;/a&gt; contains the following code that opens a video in a popup window:&lt;/p&gt;

    &lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;A&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;HREF=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;javascript:openit(&apos;tvplus.mov&apos;)&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;Such items are not picked up by wget, which means they end up missing from the crawl.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;improved-crawl&quot;&gt;Improved crawl&lt;/h2&gt;

&lt;p&gt;After some experimentation, I was able to improve the crawl by using multiple seed URLs. This means that instead of traversing the site from its index page only, I included both the unreferenced “toilet” pages, as well as a list of all the site’s visible directories and sub-directories (which I could identify from the initial crawl). First I put all of these in a text file (seed-urls.txt):&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;https://ziklies.home.xs4all.nl/
https://ziklies.home.xs4all.nl/toilet.html
https://ziklies.home.xs4all.nl/e-toilet.html
https://ziklies.home.xs4all.nl/atelier/
https://ziklies.home.xs4all.nl/bad/
https://ziklies.home.xs4all.nl/cas/
https://ziklies.home.xs4all.nl/gambia/
https://ziklies.home.xs4all.nl/keuken/
https://ziklies.home.xs4all.nl/slaapk/
https://ziklies.home.xs4all.nl/slaapk/gspot/
https://ziklies.home.xs4all.nl/toilet/
https://ziklies.home.xs4all.nl/woonk/
https://ziklies.home.xs4all.nl/woonk/agenda/
https://ziklies.home.xs4all.nl/zolder/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I then ran &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/scripts/scrape-seeds.sh&quot;&gt;this Bash script&lt;/a&gt;, using the above text file as a command-line argument:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;scrape-seeds.sh seed-urls.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;As a check I subsequently did a recursive diff on the output directories of both the original crawl and the improved crawl:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;diff &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; ./wget-original/ziklies.home.xs4all.nl/ ./wget-improved/ziklies.home.xs4all.nl/ &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; diff-site-toilet.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This showed that the improved crawl contained over 60 files that were not in the original crawl. So, I used the result of this “improved” crawl as a basis for all subsequent restoration steps. In the following sections I will go through the whole restoration process. More details are available in a separate, minimally edited &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/doc/liesbets-atelier-restoration-notes.md&quot;&gt;Restoration notes&lt;/a&gt; document.&lt;/p&gt;

&lt;h2 id=&quot;links-to-old-website-domain&quot;&gt;Links to old website domain&lt;/h2&gt;

&lt;p&gt;Like all old XS4ALL homepages, Liesbet’s Virtual Home was originally hosted as a directory under XS4ALL’s root domain (&lt;a href=&quot;http://www.xs4all.nl/~ziklies/&quot;&gt;http://www.xs4all.nl/~ziklies/&lt;/a&gt;). At some point XS4ALL gave its customers their own sub-domain (in this case the current address at &lt;a href=&quot;https://ziklies.home.xs4all.nl/&quot;&gt;https://ziklies.home.xs4all.nl/&lt;/a&gt;), and redirected any URLs pointing to the “old” location to this sub-domain. Internally, Liesbet’s Virtual Home uses a mixture of relative URLs and absolute ones that still refer to the old location. This causes several issues if the site is hosted locally on a web server. Although it may be possible to remedy these issues using some clever server configuration, I couldn’t quite get this working. I ended up writing a &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/scripts/rewriteurls.sh&quot;&gt;simple Bash script&lt;/a&gt; that replaces all references to the “old” location with relative links (which always work, irrespective of the domain).&lt;/p&gt;

&lt;h2 id=&quot;audit-trail&quot;&gt;Audit trail&lt;/h2&gt;

&lt;p&gt;Since a restoration like this involves making changes to a unique digital heritage work, it’s a good idea to record these changes in a verifiable audit trail. To achieve this I simply set up the directory with the crawl as a &lt;a href=&quot;https://en.wikipedia.org/wiki/Git&quot;&gt;Git&lt;/a&gt; repository, and then created a snapshot (Git commit) for each change. The following screenshot (from the &lt;a href=&quot;https://git-scm.com/docs/gitk&quot;&gt;gitk&lt;/a&gt; Git repository browser) illustrates this:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/06/liesbet-gitk.png&quot; alt=&quot;Gitk screenshot&quot; /&gt;
  &lt;figcaption&gt;Screenshot of Gitk Git repository browser.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The upper-left pane lists all snapshots/commits (newest at the top, oldest at the bottom).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The lower-left pane shows all changes between the currently selected snapshot and the previous one. In this example, we can see that in file “kaart01.htm” three absolute page references (highlighted in red) were replaced by relative references (highlighted in green).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The lower-right pane lists all files that were changed in this snapshot.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This way, the commit history provides a complete audit trail of all changes. It also has the advantage that any changes can be easily undone, if needed. For example, the URL rewriting operation discussed in the previous section had an unintentional side-effect for the &lt;a href=&quot;https://ziklies.home.xs4all.nl/statistics.html&quot;&gt;statistics page&lt;/a&gt;. I could easily revert to the original URLs for this page with 1 single Git command.&lt;/p&gt;

&lt;h2 id=&quot;missing-links-to-toilet&quot;&gt;Missing links to toilet&lt;/h2&gt;

&lt;p&gt;As mentioned before, the &lt;a href=&quot;https://ziklies.home.xs4all.nl/start.html&quot;&gt;start.html&lt;/a&gt; and &lt;a href=&quot;https://ziklies.home.xs4all.nl/e-start.html&quot;&gt;e-start.html&lt;/a&gt; pages contain erroneous links to the “toilet” pages. Looking at the English-language page:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;B&amp;gt;&amp;lt;A&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;HREF=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://imagine.xs4all.nl/ziklies/&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; Go to the toilet&lt;span class=&quot;nt&quot;&gt;&amp;lt;/a&amp;gt;&amp;lt;/B&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the URL points to an external domain that doesn’t exist anymore. So, I changed this to:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;B&amp;gt;&amp;lt;A&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;HREF=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;e-toilet.html&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; Go to the toilet&lt;span class=&quot;nt&quot;&gt;&amp;lt;/a&amp;gt;&amp;lt;/B&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I applied a similar fix to the Dutch-language page.&lt;/p&gt;

&lt;h2 id=&quot;image-maps&quot;&gt;Image maps&lt;/h2&gt;

&lt;p&gt;A number of pages on the site use HTML &lt;a href=&quot;https://en.wikipedia.org/wiki/Image_map&quot;&gt;image maps&lt;/a&gt;. An example is the door image on the &lt;a href=&quot;https://ziklies.home.xs4all.nl/&quot;&gt;front page&lt;/a&gt;. This is the corresponding HTML source:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;A&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;HREF=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/cgi-bin/imagemap/~ziklies/deurtje1.map&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;img&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;src=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;deurtje1.gif&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;Border=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;ISMAP&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;/A&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This is a &lt;em&gt;server-side&lt;/em&gt; image map, where the URL points to an external file (deurtje1.map). However, this file is not available anymore on the live site, which results in a &lt;a href=&quot;https://en.wikipedia.org/wiki/HTTP_404&quot;&gt;Page Not Found&lt;/a&gt; error. Fortunately, I was able to find this file in the ZIP archive provided by Zikkenheimer. Here’s what it looks like:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;default http://www.xs4all.nl/~ziklies/start.html
poly http://www.xs4all.nl/~ziklies/start.html 0,56 76,40 91,43 89,67 76,62 6,77 0,60 1,55
poly http://www.xs4all.nl/~ziklies/e-start.html 53,72 80,76 81,81 91,85 89,107 76,106 69,137 42,130 51,77
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This shows that the file simply defines areas within the image that are linked to URLs.&lt;/p&gt;

&lt;h2 id=&quot;from-server-side-to-client-side-image-maps&quot;&gt;From server-side to client-side image maps&lt;/h2&gt;

&lt;p&gt;Since &lt;em&gt;server-side&lt;/em&gt; image maps &lt;a href=&quot;https://eager.io/blog/a-quick-history-of-image-maps/&quot;&gt;come with some caveats&lt;/a&gt;, I took the liberty of re-implementing the server-side image map with a &lt;em&gt;client-side&lt;/em&gt; image map. Both are functionally identical, but client-side image maps are simpler to implement and less likely to break&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Instead of using an external file, a client-side image map is simply an embedded element inside the page, which means we can replace the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;A&amp;gt;&lt;/code&gt; element in the previous HTML snippet by this:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;img&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;src=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;deurtje1.gif&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;usemap=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;#deurtje1Map&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;alt=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;deurtje 1&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;border=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;map&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;deurtje1Map&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;area&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;shape=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;poly&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;coords=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;0,56 76,40 91,43 89,67 76,62 6,77 0,60 1,55&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.xs4all.nl/~ziklies/start.html&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;area&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;shape=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;poly&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;coords=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;53,72 80,76 81,81 91,85 89,107 76,106 69,137 42,130 51,77&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.xs4all.nl/~ziklies/e-start.html&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;area&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;shape=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;default&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.xs4all.nl/~ziklies/start.html&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/map&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that the values of the “coords” attributes are identical to the area definitions in the server-side image map. Below is a short video that shows how the restored image map works. Ringing the upper doorbell leads to the Dutch version of the site, whereas the lower doorbell opens the English version.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;video width=&quot;100%&quot; height=&quot;100%&quot; title=&quot;Image map demo&quot; controls=&quot;&quot;&gt;
    &lt;source src=&quot;/images/2020/06/imagemap.mp4&quot; type=&quot;video/mp4&quot; /&gt;
    Your browser does not support the video tag.
  &lt;/video&gt;
  &lt;figcaption&gt;Demonstration of image map.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The site contains 4 more broken server-side image maps. I replaced all of these with client-side image maps in the restored version. For &lt;a href=&quot;https://ziklies.home.xs4all.nl/start.html&quot;&gt;one page&lt;/a&gt; the corresponding image map from the ZIP file contained some odd errors, so here I took the image map of the page’s &lt;a href=&quot;https://ziklies.home.xs4all.nl/e-start.html&quot;&gt;English-language counterpart&lt;/a&gt;, and then updated all links accordingly. After these changes the image map navigation is fully functional again.&lt;/p&gt;

&lt;h2 id=&quot;interactive-bedroom-mirror&quot;&gt;Interactive bedroom mirror&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://ziklies.home.xs4all.nl/slaapk/e-slaap1.html&quot;&gt;bedroom&lt;/a&gt; of Liesbet’s Virtual Home features an “interactive mirror”. It is a web form where the visitor can select combinations of clothing, hairstyle and earrings. After clicking on the “have a look into the mirror” button, the selected combination is shown as an image&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. However, as the underlying scripts are missing from the live site, it now gives a &lt;a href=&quot;https://en.wikipedia.org/wiki/HTTP_404&quot;&gt;Page Not Found&lt;/a&gt; error. As with the image maps before, the missing (Perl) scripts could be recovered from the ZIP file provided by Zikkenheimer. I added these to a (newly created) “cgi-bin” directory. I also had to make the scripts executable&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and adjust their &lt;a href=&quot;https://en.wikipedia.org/wiki/Shebang_(Unix)&quot;&gt;Shebang&lt;/a&gt; strings to a valid interpreter location on my local machine (which, in my case, was different from the location used by the original web server).&lt;/p&gt;

&lt;p&gt;The main challenge was then to make the scripts play nicely with a web server. This is beyond the scope of this post, but I created an &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/doc/liesbets-atelier-apache-notes.md&quot;&gt;Apache setup notes&lt;/a&gt; document that describes how I made this all work with a local instance of the Apache web server. Amazingly, the script, which is nearly 25 years old, still works perfectly with a modern version of Perl (here Perl 5). The following video gives a brief glimpse of the restored interactive mirror in action:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;video width=&quot;100%&quot; height=&quot;100%&quot; title=&quot;Interactive bedroom mirror demo&quot; controls=&quot;&quot;&gt;
    &lt;source src=&quot;/images/2020/06/mirror.mp4&quot; type=&quot;video/mp4&quot; /&gt;
    Your browser does not support the video tag.
  &lt;/video&gt;
  &lt;figcaption&gt;Demonstration of interactive bedroom mirror.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;interactive-toilet-door&quot;&gt;Interactive toilet door&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://ziklies.home.xs4all.nl/e-toilet.html&quot;&gt;toilet page&lt;/a&gt; has an interactive feature where visitors can scratch a message onto a virtual toilet door. Zikkenheimer wrote to us that she initially updated the toilet door image by hand (presumably using submissions sent by e-mail), but that she later replaced this with a Python script that automatically updates the image. This script is no longer available from the live site, but it is included in the ZIP file. From a text string, it maps each &lt;a href=&quot;https://en.wikipedia.org/wiki/Glyph&quot;&gt;glyph&lt;/a&gt; to a corresponding GIF image, and then pastes these GIF images onto the pre-existing version of the toilet door image, using randomly selected image co-ordinates.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/06/toilet-door.png&quot; alt=&quot;Toilet door&quot; /&gt;
  &lt;figcaption&gt;Interactive toilet door.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Interestingly, the source of the current “live version” references another script that simply sends the entered message by e-mail&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. However, the Python script from the ZIP file shows that the toilet pages were originally hosted at a different server location, and &lt;a href=&quot;https://web.archive.org/web/19981205234146/http://prima12.xs4all.nl/liesbet/toilet.html&quot;&gt;an archived snapshot from the Internet Archive&lt;/a&gt; still exists. The source code of that snapshot also contains a link to the Python script.&lt;/p&gt;

&lt;h2 id=&quot;restoring-the-toilet-door&quot;&gt;Restoring the toilet door&lt;/h2&gt;

&lt;p&gt;I tried to make the interactive toilet door work again, but while doing this I ran into several problems. First of all, the underlying Python script is based on Python 1.4, which was one of the earliest Python releases, and its code is not compatible with modern Python interpreters. I tried to upgrade it to Python-3-compatible code. Overall this was pretty easy, but it uses &lt;a href=&quot;https://github.com/Solomoriah/gdmodule&quot;&gt;gdmodule&lt;/a&gt;, a Python module that is &lt;a href=&quot;https://github.com/Solomoriah/gdmodule/issues/3&quot;&gt;currently unsupported and not compatible&lt;/a&gt; with Python 3. I was able to make the script work with Python 2.7, but since Python 2 &lt;a href=&quot;https://www.python.org/doc/sunset-python-2/&quot;&gt;is no longer maintained&lt;/a&gt; this is not a sustainable solution.&lt;/p&gt;

&lt;p&gt;Also, the set of glyph images in the ZIP file appeared to be incomplete, only including lowercase characters (with an image for the “u” glyph strangely missing), and no uppercase characters or punctuation marks. If the user enters any of these missing characters, this results in an “Internal Server Error”. In addition, the script often places text outside of the toilet door image canvas. For both of these issues, it is unclear whether they are specific to the restored version, or simply reflect the state of the old “live” site. Because of these issues, I would not consider the restoration of this part of the site a success.&lt;/p&gt;

&lt;h2 id=&quot;unsupported-file-formats&quot;&gt;Unsupported file formats&lt;/h2&gt;

&lt;p&gt;The site also uses a number of file formats that are not supported by modern browsers. Some examples:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://ziklies.home.xs4all.nl/woonk/e-woon03.html&quot;&gt;living room&lt;/a&gt; features a clickable TV-set that is supposed to open a &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Quicktime&quot;&gt;Quicktime video&lt;/a&gt; in a pop-up window. In the latest (77.0.1) version of the Firefox browser, it triggers the following message:&lt;/p&gt;

    &lt;figure class=&quot;image&quot;&gt;
&lt;img src=&quot;/images/2020/06/quicktime-ff.png&quot; alt=&quot;No video with supported format and MIME type found&quot; /&gt;
&lt;figcaption&gt;Firefox error message on Quicktime video.&lt;/figcaption&gt;
&lt;/figure&gt;

    &lt;p&gt;In Chromium (83.0.4103.61), the pop-up window is empty, but it does download the file, so it can be played with external media player software.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The clickable stereo on the same page links to a &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/AU&quot;&gt;Sun Audio&lt;/a&gt; file. Depending on the browser used, clicking the link either activates a prompt to play the file in a external player (Firefox), or it is simply downloaded (Chromium)&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://ziklies.home.xs4all.nl/slaapk/e-slaapa.html&quot;&gt;This page&lt;/a&gt; features an embedded alarm clock in MIDI format, which is also not natively supported by modern web browsers. Chromium does download the file, whereas Firefox appears to ignore it altogether.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Last but not least, the bedroom page links to &lt;a href=&quot;https://ziklies.home.xs4all.nl/slaapk/gspot/index.html&quot;&gt;this&lt;/a&gt;, which in turn &lt;a href=&quot;https://ziklies.home.xs4all.nl/slaapk/gspot/gspot.dcr&quot;&gt;links&lt;/a&gt; to an embedded &lt;a href=&quot;https://en.wikipedia.org/wiki/Adobe_Shockwave&quot;&gt;Adobe Shockwave file&lt;/a&gt;. Adobe &lt;a href=&quot;https://helpx.adobe.com/shockwave/shockwave-end-of-life-faq.html&quot;&gt;stopped supporting this format in 2019&lt;/a&gt;. As a result, the format is no longer supported in web browsers. With no alternative rendering software available, the format is now functionally obsolete.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I haven’t addressed any of these issues in the current restoration attempt. A possible solution would be to use emulation or virtualization to view the site in a late-‘90s web browser&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. This may be worth further investigation.&lt;/p&gt;

&lt;h2 id=&quot;serving-the-site&quot;&gt;Serving the site&lt;/h2&gt;

&lt;p&gt;Throughout the restoration process I mostly used Python’s built-in &lt;a href=&quot;https://docs.python.org/3/library/http.server.html&quot;&gt;http.server&lt;/a&gt; to test any changes I made. This is a lightweight web server that doesn’t require any elaborate configuration, with no need to copy files to reserved locations on the file system. It does have some limitations that make it unsuitable for production use, so for serving the “completed” site I set up and configured an &lt;a href=&quot;https://httpd.apache.org/&quot;&gt;Apache&lt;/a&gt; web server instance. This allowed me to have the restored version of Liesbet’s Virtual Home running on my local machine, accessible from its original URL:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/06/atelier-local-apache.png&quot; alt=&quot;Liesbet&apos;s Virtual Home screenshot&quot; /&gt;
  &lt;figcaption&gt;Screenshot of Liesbet&apos;s Virtual Home, served with local Apache instance.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The installation and configuration process I followed is described in detail in a separate &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/doc/liesbets-atelier-apache-notes.md&quot;&gt;Apache setup notes&lt;/a&gt; document.&lt;/p&gt;

&lt;h2 id=&quot;warc-capture&quot;&gt;WARC capture&lt;/h2&gt;

&lt;p&gt;Like most web archives, the KB uses the &lt;a href=&quot;https://en.wikipedia.org/wiki/Web_ARChive&quot;&gt;WARC&lt;/a&gt; format for storing archived web sites. Since capturing offline web content is a subject I’d been &lt;a href=&quot;/2018/07/11/crawling-offline-web-content-the-nl-menu-case&quot;&gt;working on earlier as part of the NL-menu rescue operation&lt;/a&gt;, I started with &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/scripts/scrape-local-site.sh&quot;&gt;this wget-based script&lt;/a&gt;, which is a modified version of the script I used for NL-menu. However, the wget crawl didn’t adequately capture the form (and script) behind the &lt;a href=&quot;https://ziklies.home.xs4all.nl/slaapk/e-slaap1.html&quot;&gt;interactive bedroom mirror&lt;/a&gt;. Some tests with the &lt;a href=&quot;https://github.com/webrecorder/webrecorder-desktop&quot;&gt;Webrecorder Desktop App&lt;/a&gt; showed that Webrecorder was able to capture individual input combinations, but this required manual input for each combination. With 512 possible combinations, this was not a viable solution, so I needed some way to automate this.&lt;/p&gt;

&lt;h2 id=&quot;warcio-to-the-rescue&quot;&gt;Warcio to the rescue&lt;/h2&gt;

&lt;p&gt;Happily, several people responded to &lt;a href=&quot;https://twitter.com/bitsgalore/status/1275405890947108866&quot;&gt;my request for help on Twitter&lt;/a&gt;. Webrecorder author Ilya Kreymer responded I might want to have a look at the &lt;a href=&quot;https://github.com/webrecorder/warcio&quot;&gt;warcio library&lt;/a&gt; (of which is he is also the lead developer), and, even better, he &lt;a href=&quot;https://twitter.com/IlyaKreymer/status/1275440674687471617&quot;&gt;provided some example code&lt;/a&gt; that showed how to do this. For a brief explanation of how this works, have a look at the input form below (for brevity I edited out some of the input choices):&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;FORM&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;METHOD=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;POST&quot;&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;ACTION=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/cgi-bin/barbie1.cgi&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;

&lt;span class=&quot;nt&quot;&gt;&amp;lt;DL&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;DD&amp;gt;&amp;lt;b&amp;gt;&lt;/span&gt; First: wich pants or skirt? &lt;span class=&quot;nt&quot;&gt;&amp;lt;/b&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;onder&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1a&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; A1
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;onder&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2a&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; A2
 ::
 ::
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;onder&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;7a&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; A7&lt;span class=&quot;nt&quot;&gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/DL&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;DL&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;DD&amp;gt;&amp;lt;b&amp;gt;&lt;/span&gt;Second: wich top, sweater or blouse fits? &lt;span class=&quot;nt&quot;&gt;&amp;lt;/b&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;midden&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1b&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; B1
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;midden&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2b&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; B2
 ::
 ::
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;midden&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;7b&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; B7&lt;span class=&quot;nt&quot;&gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/DL&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;DL&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;DD&amp;gt;&amp;lt;b&amp;gt;&lt;/span&gt;Last: wich earrings and what to do with her hair? &lt;span class=&quot;nt&quot;&gt;&amp;lt;/b&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;top&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1c&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; C1
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;top&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;2c&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; C2
 ::
 ::
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;radio&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;NAME=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;top&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;7c&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; C7&lt;span class=&quot;nt&quot;&gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/DL&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;center&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;INPUT&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;TYPE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;submit&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;VALUE=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;have a look into the mirror&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;br&amp;gt;&amp;lt;/center&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/FORM&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Some key points:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The form sets three variables: “onder”, “midden” and “top”. Each of these can have 7 pre-defined values (with ranges [1a, 7a], [1b, 7b] and [1c, 7c], respectively).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The data that are entered in the form are sent to the server using a &lt;a href=&quot;https://en.wikipedia.org/wiki/POST_(HTTP)&quot;&gt;POST&lt;/a&gt; request.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With warcio, an individual set of input combinations can be captured like this:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;warcio.capture_http&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;capture_http&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;requests&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;http://ziklies.home.xs4all.nl&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;warcFile&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;ziklies.home.xs4all.nl.gz&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;capture_http&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;warcFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;requests&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;post&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;url&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;onder&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1a&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;midden&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1b&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;1c&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note how the &lt;em&gt;data&lt;/em&gt; parameter holds a dictionary with the three variables and their associated values. To capture the form’s full behavior, we can simply iterate over all input combinations, and then capture each of them. Having confirmed that this worked, I re-wrote my existing wget-based Bash script into a Python script that only uses warcio. The script is &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/scripts/scrape-ziklies-local.py&quot;&gt;available here&lt;/a&gt;. This approach doesn’t work for the interactive toilet door, since, unlike the bedroom mirror, it processes free text, which means that the number of possible inputs is infinite.&lt;/p&gt;

&lt;h2 id=&quot;rendering-the-warc-with-pywb&quot;&gt;Rendering the WARC with Pywb&lt;/h2&gt;

&lt;p&gt;I finally verified the WARC capture by importing it in &lt;a href=&quot;https://github.com/webrecorder/pywb&quot;&gt;Pywb&lt;/a&gt;. More details on this can be found in my separate &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/doc/liesbets-atelier-warc-notes.md&quot;&gt;WARC capture and rendering notes&lt;/a&gt;. Rendering the WARC did not result in any problems, and to illustrate this, below screenshot shows one output combination of the interactive bedroom mirror:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/06/mirror-pywb.png&quot; alt=&quot;Mirror script screenshot&quot; /&gt;
  &lt;figcaption&gt;Output of interactive bedroom mirror script, rendered from WARC capture.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Although none of the procedures described in this blog post are particularly complex, the restoration of Liesbet’s Virtual Home involved a lot of trial and error, which ultimately made it a fairly laborious process. Nevertheless, this effort is justified given the value and historical significance of this homepage. It also involved some decisions that, from an authenticity point of view, are open to criticism. An example is the replacement of the missing server-side image maps by the more modern client-side variety. In any case, the experience gained from this project will also be useful for an upcoming attempt to restore a collection of old corporate websites from source data &lt;a href=&quot;/2019/09/09/recovering-90s-data-tapes-experiences-kb-web-archaeology&quot;&gt;that we recovered from data tapes last year&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;I would like to thank Liesbet Zikkenheimer for making the offline data of Liesbet’s Virtual Home available to us for this project. Jak Boumans is thanked for alerting us to this unique homepage, and for establishing the contact with its creator. Ilya Kreymer is thanked for his suggestion on warcio, and more generally for creating the Webrecorder software suite, without which much of this work would simply be impossible. Thanks are also due to Kees Teszelszky. Finally, this work was financially supported by the &lt;a href=&quot;https://www.sidn.nl/en&quot;&gt;SIDN Fund&lt;/a&gt; and &lt;a href=&quot;https://www.stichtinginternet4all.nl/&quot;&gt;Stichting Internet4all&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;additional-resources&quot;&gt;Additional resources&lt;/h2&gt;

&lt;h3 id=&quot;detailed-working-notes&quot;&gt;Detailed working notes&lt;/h3&gt;

&lt;p&gt;The following notes provide more details on the restoration steps, the Apache server setup and the WARC capture process, respectively:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/doc/liesbets-atelier-restoration-notes.md&quot;&gt;Liesbet’s atelier restoration notes&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/doc/liesbets-atelier-apache-notes.md&quot;&gt;Liesbet’s Atelier Apache setup notes&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/blob/master/doc/liesbets-atelier-warc-notes.md&quot;&gt;Liesbet’s atelier WARC capture and rendering notes&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;processing-scripts&quot;&gt;Processing scripts&lt;/h3&gt;

&lt;p&gt;All processing scripts used as part of this work are &lt;a href=&quot;https://github.com/KBNLresearch/xs4all-resources/tree/master/scripts&quot;&gt;available here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;more-information-about-the-xs4all-homepages-rescue-initiative&quot;&gt;More information about the XS4ALL homepages rescue initiative&lt;/h3&gt;

&lt;p&gt;Below posts (both in Dutch) give some additional background information about the XS4ALL homepages rescue initiative:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.kb.nl/blogs/duurzame-toegang/bewaren-voor-iedereen-de-opbouw-van-de-webcollectie-xs4all-homepages&quot;&gt;Bewaren voor iedereen: de opbouw van de webcollectie XS4ALL-homepages&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.sidnfonds.nl/nieuws/redden-wat-van-waarde-is-webarchivering-homepages-xs4all&quot;&gt;Redden wat van waarde is: webarchivering homepages XS4ALL&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;8 July 2020: added sections on toilet door restoration attempt&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Some purists may consider this a technological anachronism. Client-side image maps were first introduced in HTML 3.2, which was published in 1997, whereas Liesbet’s &lt;a href=&quot;https://ziklies.home.xs4all.nl/new.html&quot;&gt;“what’s new” page&lt;/a&gt; shows that most of the site’s development activity took place between early 1995 and late 1997. This could be a problem for users who want to view the site in a period browser (e.g. inside an emulated environment), which may not support client-side image maps. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Actually as a composite of 3 images. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Under Linux this is simply a matter of issuing a command like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;chmod 755 barbie.cgi&lt;/code&gt;. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;Documented here: &lt;a href=&quot;https://www.xs4all.nl/service/installeren/hosting/mail-a-form-toevoegen/&quot;&gt;https://www.xs4all.nl/service/installeren/hosting/mail-a-form-toevoegen/&lt;/a&gt;. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;I’m not actually sure if this behavior was any different on late-‘90s browsers. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;I did try to open the page using &lt;a href=&quot;http://oldweb.today/&quot;&gt;oldweb.today&lt;/a&gt;, but none of the browser environments I tried had the necessary Shockwave plugin installed. &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2020/06/30/restoring-liesbets-virtual-home</link>
                <guid>https://bitsgalore.org/2020/06/30/restoring-liesbets-virtual-home</guid>
                <pubDate>2020-06-30T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>ISO/IEC TS 22424 standard on EPUB3 preservation</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/04/scream.jpg&quot; alt=&quot;Scream&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://commons.wikimedia.org/wiki/File:%27The_Scream%27,_undated_drawing_Edvard_Munch,_Bergen_Kunstmuseum.JPG&quot;&gt;“The Scream”&lt;/a&gt;, undated drawing by Edvard Munch, Bergen Kunstmuseum, Public domain.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Earlier this week Library of Congress added &lt;a href=&quot;https://www.loc.gov/preservation/digital/formats/fdd/fdd000519.shtml&quot;&gt;a new entry on the standard “Digital publishing — EPUB3 preservation” (ISO/IEC TS 22424)&lt;/a&gt; to its excellent Digital Formats web site. This standard was developed by the &lt;a href=&quot;https://www.iso.org/committee/45374.html&quot;&gt;ISO Technical Committee on Document description and processing languages&lt;/a&gt;, and was published in January this year (2020).&lt;/p&gt;

&lt;p&gt;According to its authors, “the ISO/IEC TS 22424 series supports long-term preservation of EPUB publications via a dual strategy”. The standard is made up of 2 parts, which are sold as separate documents on the ISO website:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.iso.org/standard/73163.html&quot;&gt;Part 1: Principles (ISO/IEC TS 22424-1:2020)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.iso.org/standard/73169.html&quot;&gt;Part 2: Metadata requirements ISO/IEC TS 22424-2:2020&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this blog post I will take a closer look at both parts of the standard. What do they purport, what is their scope, and to what degree do they live up to their stated promises? Readers who are only interested in the most important findings may want to jump to the “Summary and discussion” section at the end of this post.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;As I didn’t have access to the final as-published versions of the documents, this review is based on a combination of the (non-paywalled) preview documents on the ISO website, and a pre-publication “Final text” from September 2019. The latter may be marginally different from the published standard, but I don’t expect these differences will affect the general conclusions of this review.&lt;/p&gt;

&lt;h2 id=&quot;part-1---principles&quot;&gt;Part 1 - Principles&lt;/h2&gt;

&lt;p&gt;Even though the full standard documents are behind a paywall, a &lt;a href=&quot;https://www.iso.org/obp/ui/#iso:std:iso-iec:ts:22424:-1:ed-1:v1:en&quot;&gt;free preview of Part 1 is available here&lt;/a&gt;. It contains the introductory chapter, as well as the full table of contents.&lt;/p&gt;

&lt;h3 id=&quot;purpose-and-scope&quot;&gt;Purpose and scope&lt;/h3&gt;

&lt;p&gt;The introductory chapter clearly defines the purpose of this part of the standard:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This document facilitates the long-term preservation of EPUB publications by specifying in general level EPUB features which are mandatory for long-term preservation (such as font embedding) and features which should be avoided if possible.&lt;/p&gt;

  &lt;p&gt;This document can be seen as a stepping stone towards a detailed specification which would be related to EPUB in the same way as PDF/A, specified in ISO 19005-1 to ISO 19005-3, is related to the Portable Document Format (PDF).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This suggests that the purpose of Part 1 of the standard is make some first steps towards a “preservation-friendly” EPUB profile, analogous to the &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/A&quot;&gt;PDF/A&lt;/a&gt; profiles for the PDF format.&lt;/p&gt;

&lt;p&gt;According to the authors, long-term preservation requires:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;making the object such as [sic] EPUB publication fit for preservation, including features to be used and features to avoid;&lt;/li&gt;
  &lt;li&gt;packaging the object (and any metadata related to it) together with any additional data such as other versions of the object and other documentation into an Open Archival Information System (OAIS) submission information package (SIP).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;They then write that “packaging is covered in ISO/IEC TS 22424-2” (Part 2 of the standard).&lt;/p&gt;

&lt;h3 id=&quot;part-1-is-actually-about-packaging-well-mostly&quot;&gt;Part 1 is actually about packaging (well, mostly)&lt;/h3&gt;

&lt;p&gt;Imagine my surprise then when I found the two main chapters of Part 1 to be titled “Packaging standards” (with a discussion of &lt;a href=&quot;https://wiki.dpconline.org/index.php?title=2.2.3_INFORMATION_PACKAGE_VARIANTS&quot;&gt;OAIS Information Packages&lt;/a&gt;), and “Construction of OAIS information packages”. The latter chapter contains an exhaustive list of requirements and recommendations on how EPUB documents should be packaged in an OAIS Submission Information Package (SIP). So, on the surface it looks like the content that is supposed to be in Part 2 of the standard (packaging) somehow ended up in Part 1 instead!&lt;/p&gt;

&lt;p&gt;However, a more in-depth look reveals that the generic packaging requirements (which are unrelated to EPUB) are mixed with requirements that actually &lt;em&gt;are&lt;/em&gt; specific to EPUB. For example, section 6.2.1 (“EPUB publications SHALL be sent to a repository system as well‐formed and complete Submission Information Packages”) lists requirements on validity against the EPUB specification, font embedding, multimedia content, remote resources and encryption. Confusingly, these are interspersed with other requirements on descriptive metadata and submission agreements (which are not EPUB-specific). Moreover, most of the packaging-related requirements and recommendations mentioned here are totally unrelated to EPUB, and would apply generically to any given format. This makes me wonder why these are part of this standard at all.&lt;/p&gt;

&lt;h3 id=&quot;some-requirements-already-covered-by-epub-specification&quot;&gt;Some requirements already covered by EPUB specification&lt;/h3&gt;

&lt;p&gt;In addition, some of the EPUB-specific requirements are already covered by the EPUB format specification. For example, section 6.4 (“Structure of information packages”) lists requirements on the presence of a manifest file inside an EPUB, the use of the EPUB Open Container Format, the presence of a “container.xml” file, and metadata inside the EPUB package and navigation documents. Any EPUB file that is valid against the EPUB format specification (which is required as per section 6.2.1) already satisfies these requirements, so their inclusion here is unnecessary.&lt;/p&gt;

&lt;h2 id=&quot;part-2---metadata-requirements&quot;&gt;Part 2 - Metadata requirements&lt;/h2&gt;

&lt;p&gt;For Part 2, there is again a &lt;a href=&quot;https://www.iso.org/obp/ui/#iso:std:73169:en&quot;&gt;free preview&lt;/a&gt; with an introductory chapter and the full table of contents.&lt;/p&gt;

&lt;h3 id=&quot;purpose-and-scope-1&quot;&gt;Purpose and scope&lt;/h3&gt;

&lt;p&gt;The introduction defines the purpose of this part of the standard as follows:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This document facilitates the long-term preservation of EPUB publications by specifying metadata elements which are required or recommended for long-term preservation (such as identifiers) and the ways in which the EPUB publication and related metadata can be packaged. EPUB versions 3 and 3.0.1 are covered; if necessary, the EPUB version applicable is specified.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, this suggests this is all about metadata and packaging. In the “Scope” section the authors also say this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This document makes EPUB compliant with current practices of Open Archival Information Systems (OAIS) archives and technical requirements of repository systems. The former tend to rely on OAIS in their operations; the latter prefer to ingest electronic documents only in containers conforming to standards such as METS (Metadata Encoding and Transmission Standard).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This statement is remarkable, since OAIS does not include any requirements about file formats, so the concept of “OAIS-compliant” file formats is meaningless!&lt;/p&gt;

&lt;h3 id=&quot;sip-level-metadata-vs-epub-level-metadata&quot;&gt;SIP-level metadata vs EPUB-level metadata&lt;/h3&gt;

&lt;p&gt;The main body of Part 2 contains requirements related to various types of metadata. For example, Chapter 6 covers “Packaging metadata”, which the authors describe as follows:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This chapter covers mainly metadata about the SIP (container) which is usually submitted using METS elements and attributes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;However, it turns out that some sections within this chapter are actually about metadata &lt;em&gt;within&lt;/em&gt; EPUB files. Examples are section 6.6 (“Core media type resource identifiers”), which defines requirements on identifiers used in the EPUB manifest (which is part of the EPUB file), and section 6.7 (“Foreign resource identifiers”).&lt;/p&gt;

&lt;p&gt;Like Part 1, some of the EPUB-specific requirements that are listed here are already covered by the EPUB specification. For example, section 7.2 (“Technical metadata”) states:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;EPUB version used SHALL be specified in the package element of the EPUB publication’s content.opf file.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Since ‘version’ is a mandatory attribute of an EPUB 3 Package Document, the inclusion of this requirement is unnecessary. Similarly, section 7.4 (“Structural metadata”) states:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;EPUB Open Container Format (OCF) SHALL be used to describe the structure of [sic] EPUB publication.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Again, this is true for &lt;em&gt;any&lt;/em&gt; valid EPUB file, so why is this included in this standard?&lt;/p&gt;

&lt;h3 id=&quot;oais-packages-again&quot;&gt;OAIS packages (again!)&lt;/h3&gt;

&lt;p&gt;The final two chapters of Part 2 are about the structure and content of OAIS Submission Information Packages. But OAIS information packages are already discussed at length in Part 1 (even though that discussion probably shouldn’t be there to begin with!), so why is this subject revisited here?&lt;/p&gt;

&lt;h2 id=&quot;summary-and-discussion&quot;&gt;Summary and discussion&lt;/h2&gt;

&lt;p&gt;Since this blog post turned out a lot lengthier than I originally intended, this section summarizes and then discusses the most important findings. Overall, I was really surprised to find so many serious issues in a standard that has been authored over a 3-year period by an ISO-coordinated committee, especially given that most of these issues are obvious from even a cursory inspection of the standards documents!&lt;/p&gt;

&lt;h3 id=&quot;lack-of-structure&quot;&gt;Lack of structure&lt;/h3&gt;

&lt;p&gt;Based on the abstracts on the ISO web site and the introductory sections in the standards documents I expected the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Part 1: requirements and recommendations that are a first step in the direction of a preservation-friendly EPUB profile, analogous to PDF/A for the PDF format.&lt;/li&gt;
  &lt;li&gt;Part 2: requirements and recommendations on the packaging of publications in EPUB format into OAIS Submission Information Packages (SIPs), along with accompanying metadata.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, Part 1 of the standard is mostly about packaging, but elements of a “preservation-friendly” EPUB profile are buried within sections with package requirements and recommendations. On the other hand, Part 2 (which is supposed to be about packaging and metadata) also contains additional EPUB-specific requirements that one would expect to be part of the EPUB profile (that is, Part 1). Moreover, some of the EPUB-specific requirements are already covered by the EPUB format specification, which makes their inclusion in ISO/IEC TS 22424 unnecessary.&lt;/p&gt;

&lt;p&gt;The overall effect of this is that elements of the “dual strategy” that underlies ISO/IEC TS 22424 are now scattered across both parts of the standard in a seemingly arbitrary manner. This makes using and navigating the standard unnecessarily difficult. Since both parts of the standard are only available upon payment of CHF 118 and CHF 138, respectively, it also means that buyers of any of these documents will receive a product that is substantially different from its description on the ISO web site!&lt;/p&gt;

&lt;h3 id=&quot;path-towards-a-preservation-friendly-epub-profile&quot;&gt;Path towards a preservation-friendly EPUB profile&lt;/h3&gt;

&lt;p&gt;The introduction of Part 1 of the standard mentions that the EPUB-specific requirements here are only a “stepping stone” towards a more detailed specification (analogous to PDF/A), that could be developed at a later stage by the EPUB community. Even though the authors should be commended for their modesty in this regard, it does make me wonder if such an early-stages effort merits an ISO-published standard. I’m also not convinced of the effectiveness of a “stepping stone” document that is hidden behind a paywall. I do think a preservation-friendly EPUB profile would be useful, and I also think that various people from the digital libraries and archives sector would be willing to contribute to it. But only if this is done in an open way, and not by some arcane ISO-guided process that first keeps the progress hidden from its potential users for 3 years, and then buries the results behind a paywall (which appears to have been the case for the current standard).&lt;/p&gt;

&lt;h3 id=&quot;should-packaging-even-be-part-of-this-standard&quot;&gt;Should packaging even be part of this standard?&lt;/h3&gt;

&lt;p&gt;Although I’ve only given the packaging and package-level metadata requirements and recommendations a cursory look, on the surface many of them are generally applicable to any digital content, and are not specific to publications in EPUB format. At the same time, package-level requirements are typically strongly tied to institutional policies. For instance, Part 2 of the standard requires that the SIP-level METS and PREMIS metadata describe all resources that are inside an EPUB container, which seems a bit excessive to me. This all makes me wonder if OAIS-packaging should be part of this standard at all.&lt;/p&gt;

&lt;h3 id=&quot;no-epub-32&quot;&gt;No EPUB 3.2!&lt;/h3&gt;

&lt;p&gt;It is also worth noting that the current standard only covers EPUB 3 and EPUB 3.0.1 (which was published in 2014). It does not cover the current EPUB 3.2, but it does mention that this version will be covered in the next edition of the standard&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. This is surprising, given that EPUB 3.2 was already published in May 2019. It also means that the standard was already outdated at its date of publication.&lt;/p&gt;

&lt;h2 id=&quot;additional-resources&quot;&gt;Additional resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.iso.org/standard/73163.html&quot;&gt;Digital publishing — EPUB3 preservation — Part 1: Principles&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.iso.org/standard/73169.html&quot;&gt;Digital publishing — EPUB3 preservation — Part 2: Metadata requirements&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.loc.gov/preservation/digital/formats/fdd/fdd000519.shtml&quot;&gt;EPUB (Electronic publication) Version 3 Preservation. ISO/IEC TS 22424:2020 (Library of Congress Sustainability of Digital Formats)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;EPUB 3.1 is not covered either, but this is understandable, since this version was never supported by any reader or validation software. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2020/04/30/iso-iec-ts-22424-standard-on-epub3-preservation</link>
                <guid>https://bitsgalore.org/2020/04/30/iso-iec-ts-22424-standard-on-epub3-preservation</guid>
                <pubDate>2020-04-30T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Does Microsoft OneDrive export large ZIP files that are corrupt?</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/03/broken-zip.jpg&quot; alt=&quot;Broken zip £12.50&quot; /&gt;
  &lt;figcaption&gt;&lt;a href=&quot;https://www.flickr.com/photos/dichohecho/3810699621/&quot;&gt;“Broken zip £12.50”&lt;/a&gt; by &lt;a href=&quot;https://www.flickr.com/photos/dichohecho/&quot;&gt;dichoecho&lt;/a&gt;, used under &lt;a href=&quot;https://creativecommons.org/licenses/by/2.0/&quot;&gt;CC BY&lt;/a&gt; / Cropped from original.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;We recently started using &lt;a href=&quot;https://en.wikipedia.org/wiki/OneDrive&quot;&gt;Microsoft OneDrive&lt;/a&gt; at work. The other day a colleague used OneDrive to share a folder with a large number of ISO images with me. Since I wanted to work with these files on my Linux machine at home, and no official OneDrive client for Linux exists a this point, I used OneDrive’s web client to download the contents of the folder. Doing so resulted in a 6 GB &lt;a href=&quot;https://en.wikipedia.org/wiki/Zip_(file_format)&quot;&gt;ZIP&lt;/a&gt; archive. When I tried to extract this ZIP file with my operating system’s (Linux Mint 19.3 MATE) archive manager, this resulted in an error dialog, saying that “An error occurred while loading the archive”:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/03/archive-manager-onedrive.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The output from the underlying extraction tool (&lt;a href=&quot;https://en.wikipedia.org/wiki/7-Zip&quot;&gt;&lt;em&gt;7-zip&lt;/em&gt;&lt;/a&gt;) reported a “Headers Error”, with an “Unconfirmed start of archive”. It also reported a warning that “There are data after the end of archive”. No actual data were extracted whatsoever. This all looked a bit worrying, so I decided to have a more in-depth look at this problem.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;extracting-with-unzip&quot;&gt;Extracting with Unzip&lt;/h2&gt;

&lt;p&gt;As a first test I tried to extract the file from the terminal using &lt;em&gt;unzip&lt;/em&gt; (v. 6.0) using the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;unzip kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in the following output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Archive:  kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip
warning [kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip]:  1859568605 extra bytes at beginning or within zipfile
  (attempting to process anyway)
error [kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip]:  start of central directory not found;
  zipfile corrupt.
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, according to &lt;em&gt;unzip&lt;/em&gt; the file is simply corrupt. &lt;em&gt;Unzip&lt;/em&gt; wasn’t able to extract any actual data.&lt;/p&gt;

&lt;h2 id=&quot;extracting-with-7-zip&quot;&gt;Extracting with 7-zip&lt;/h2&gt;

&lt;p&gt;Next I tried to  to extract the file with &lt;em&gt;7-zip&lt;/em&gt; (v. 16.02) using this command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;7z x kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in the following (lengthy) output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,4 CPUs Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz (506E3),ASM,AES-NI)

Scanning the drive for archives:
1 file, 6154566547 bytes (5870 MiB)

Extracting archive: kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip

ERRORS:
Headers Error
Unconfirmed start of archive


WARNINGS:
There are data after the end of archive

--
Path = kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip
Type = zip
ERRORS:
Headers Error
Unconfirmed start of archive
WARNINGS:
There are data after the end of archive
Physical Size = 4330182775
Tail Size = 1824383772

ERROR: CRC Failed : kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f/afd3f61a-5e0e-11ea-ab97-40b0341fbf5f/08.wav
                                                                              
Sub items Errors: 1

Archives with Errors: 1

Warnings: 1

Open Errors: 1

Sub items Errors: 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here we see the familiar “Headers Error” and “Unconfirmed start of archive” errors, as well as a warning about a &lt;a href=&quot;https://en.wikipedia.org/wiki/Cyclic_redundancy_check&quot;&gt;cyclic redundancy check&lt;/a&gt; that failed on an extracted file. Unlike &lt;em&gt;unzip&lt;/em&gt;, &lt;em&gt;7-zip&lt;/em&gt; does succeed in extracting some of the data, but seeing that the size of the extracted folder is only 4.3 GB, the extraction is incomplete (the size of the ZIP file is 6 GB!).&lt;/p&gt;

&lt;h2 id=&quot;4-gib-size-limit-and-zip64&quot;&gt;4 GiB size limit and ZIP64&lt;/h2&gt;

&lt;p&gt;At this point I started wondering if these issues could be related to the &lt;em&gt;size&lt;/em&gt; of this particular ZIP file, especially since I have been able to process zipped OneDrive folders before without any problems. The &lt;a href=&quot;https://en.wikipedia.org/wiki/Zip_(file_format)#ZIP64&quot;&gt;Wikipedia entry on &lt;em&gt;ZIP&lt;/em&gt;&lt;/a&gt; states that originally the format had a 4 GiB limit on the total size of the archive (as well as both the uncompressed and compressed size of a file). To overcome these limitations, a “ZIP64” extension was added to the format in &lt;a href=&quot;https://web.archive.org/web/20011203085830/http://www.pkware.com/support/appnote.txt&quot;&gt;version 4.5 of the ZIP specification&lt;/a&gt; (which was published in 2001). To be sure, I verified that both &lt;em&gt;unzip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt; on my machine support ZIP64&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;small-onedrive-zip-and-home-rolled-large-zip&quot;&gt;Small OneDrive ZIP and home-rolled large ZIP&lt;/h2&gt;

&lt;p&gt;I did some additional tests to verify if my problem could be a ZIP64-related issue. First I downloaded a smaller (&amp;lt;4 GB) folder from OneDrive, and tried to extract the resulting ZIP file with &lt;em&gt;unzip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt;. Both were able to extract the file without any issues. Next I created two 8 GB ZIP files from data on my local machine with both the &lt;em&gt;zip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt; tools. I then tried to extract both files with both &lt;em&gt;unzip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt; (i.e. I extracted each file with both tools). Again, both extracted these files without any problems. Since these tests demonstrate that both &lt;em&gt;unzip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt; are able to handle both large ZIP files (which by definition use ZIP64) as well as smaller OneDrive ZIP files, this suggests that something odd is going on with OneDrive’s implementation of ZIP64.&lt;/p&gt;

&lt;h2 id=&quot;testing-the-zip-file-integrity&quot;&gt;Testing the ZIP file integrity&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;zip&lt;/em&gt; tool has a switch that can be used to test the integrity of a ZIP file. I ran it on the problematic file like this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;zip &lt;span class=&quot;nt&quot;&gt;-T&lt;/span&gt; kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s the result:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Could not find:
  kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.z01

Hit c      (change path to where this split file is)
    q      (abort archive - quit)
 or ENTER  (try reading this split again): 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, apparently the &lt;em&gt;zip&lt;/em&gt; utility thinks this is a multi-volume archive (which it isn’t). Running this command on any of my other test files (the small OneDrive file, and the large files created by &lt;em&gt;zip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt;) didn’t result in any errors.&lt;/p&gt;

&lt;h2 id=&quot;tests-with-pythons-zipfile-module&quot;&gt;Tests with Python’s zipfile module&lt;/h2&gt;

&lt;p&gt;The Python programming language by default includes a &lt;a href=&quot;https://docs.python.org/3/library/zipfile.html&quot;&gt;&lt;em&gt;zipfile&lt;/em&gt;&lt;/a&gt; module, which has tools for reading and writing ZIP files. So, I wrote the following script, which opens the ZIP file in read mode, and then reads its contents (I used Python 3.6.9 for this):&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zipfile&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Open ZIP file
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;myZip&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;zipfile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;ZipFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;kb-4d8a2f9a-5e0b-11ea-9376-40b0341fbf5f.zip&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                        &lt;span class=&quot;n&quot;&gt;mode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;r&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&apos;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Read all files in archive and check their CRCs and file headers.
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;myZip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;testzip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;# Close the ZIP file
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;myZip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nf&quot;&gt;close&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Running the script raised the following error:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;zipfile.BadZipFile: zipfiles that span multiple disks are not supported
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This looks somewhat related to the outcome of &lt;em&gt;zip&lt;/em&gt;’s integrity test, which reported a multi-volume archive. Looking at the &lt;a href=&quot;https://github.com/python/cpython/blob/d3af92ecc2f41d920e9a66211e2ab631fc473163/Lib/zipfile.py#L232&quot;&gt;source code of the zipfile module&lt;/a&gt; shows that this particular error is raised if a check on 2 data fields from the “zip64 end of central dir locator” fails:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;diskno&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;disks&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BadZipFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;zipfiles that span multiple disks are not supported&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s the description of this data structure in the &lt;a href=&quot;https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.6.TXT&quot;&gt;format specification&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;zip64 end of central dir locator 
      signature                       4 bytes  (0x07064b50)
      number of the disk with the
      start of the zip64 end of 
      central directory               4 bytes
      relative offset of the zip64
      end of central directory record 8 bytes
      total number of disks           4 bytes
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Python’s &lt;em&gt;zipfile&lt;/em&gt; raises the error if either the value of the “number of the disk with the start of the zip64 end of central directory” field (variable &lt;em&gt;diskno&lt;/em&gt;) isn’t equal to 0, or the “total number of disks” (variable &lt;em&gt;disks&lt;/em&gt;) is larger than 1. So, I opened the file in a Hex editor, and zoomed in on the “zip64 end of central dir locator”:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/03/onedrive-hex.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here, the highlighted bytes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0x504b0607&lt;/code&gt;) make up the signature of the “zip64 end of central dir locator”&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. The 4 bytes inside the blue rectangle contain the “number of the disk” value. Here, its value is 0, which is the correct and expected value. The 4 bytes inside the red rectangle contain the “total number of disks” value, which is also 0. But this is really odd, since neither value should trigger the “zipfiles that span multiple disks are not supported” error! Also, a check on the 8 GB ZIP files that I had created myself with &lt;em&gt;zip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt; showed both to have a value of 1 for this field. So what’s going on here?&lt;/p&gt;

&lt;h2 id=&quot;digging-into-zipfiles-history&quot;&gt;Digging into zipfile’s history&lt;/h2&gt;

&lt;p&gt;The most likely explanation I could think of, was some difference between my local version of the Python &lt;em&gt;zipfile&lt;/em&gt; module and the latest published version on Github. Using Github’s &lt;a href=&quot;https://help.github.com/en/github/managing-files-in-a-repository/tracking-changes-in-a-file&quot;&gt;blame view&lt;/a&gt;, I inspected the revision history of the part of the check that raises the error. This revealed &lt;a href=&quot;https://github.com/python/cpython/commit/ab0716ed1ea2957396054730afbb80c1825f9786&quot;&gt;a recent change to &lt;em&gt;zipfile&lt;/em&gt;&lt;/a&gt;: prior to a patch that was submitted in May 2019, the offending check was done slightly differently:&lt;/p&gt;

&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;diskno&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;or&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;disks&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;BadZipFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;zipfiles that span multiple disks are not supported&lt;/span&gt;&lt;span class=&quot;sh&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that in the old situation the test would fail if &lt;em&gt;disks&lt;/em&gt; was any value other than 1, whereas in the new situation it only fails if &lt;em&gt;disks&lt;/em&gt; is greater than 1. Given that for our OneDrive file the value is 0, this explains why the old version results in the error. The Git commit of the patch also includes the following note:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Added support for ZIP files with disks set to 0. Such files are commonly created by builtin tools on Windows when use ZIP64 extension.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So could this be the vital clue we need to solve this little file format mystery? Re-running my Python test script with the latest version of the &lt;em&gt;zipfile&lt;/em&gt; module did not result in any reported errors, so this looked hopeful for a start. But is the 0 value of “total number of disks” also the thing that makes unzip and 7-zip choke?&lt;/p&gt;

&lt;h2 id=&quot;hacking-into-the-onedrive-zip-file&quot;&gt;Hacking into the OneDrive ZIP file&lt;/h2&gt;

&lt;p&gt;To put this to the test, I first made a copy of the OneDrive ZIP file. I opened this file in a Hex editor, and did a search on the hexadecimal string &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0x504b0607&lt;/code&gt;, which is the signature that indicates the start of the “zip64 end of central dir locator”&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. I then changed the first byte of the “total number of disks” value (this is the 13th byte after the signature, indicated by the red rectangle in the screenshot) from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0x00&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0x01&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/03/onedrive-hex-fixed.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This effectively sets the “total number of disks” value to 1 (unsigned little-endian 32-bit value). After saving the file, I repeated all my previous tests with &lt;em&gt;unzip&lt;/em&gt;, &lt;em&gt;7-zip&lt;/em&gt;, as well as &lt;em&gt;zip&lt;/em&gt;’s integrity check. The modified ZIP file passed all these tests without any problems! The contents of the file could be extracted normally, and the extraction is also complete. The file can also be opened normally in Linux Mint’s archive manager, as this screenshot shows:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/03/onedrive-fixed-archive-manager.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So, it turns out that the cause of the problem is the value of one field in the “zip64 end of central dir locator”, which can be provisionally fixed by nothing more than changing one single bit!&lt;/p&gt;

&lt;h2 id=&quot;other-reports-on-this-problem&quot;&gt;Other reports on this problem&lt;/h2&gt;

&lt;p&gt;If a widely-used platform by a major vendor such as Microsoft produces ZIP archives with major interoperability issues, I would expect others to have run into this before. &lt;a href=&quot;https://askubuntu.com/questions/1115238/unable-to-extract-onedrive-zip-file-on-ubuntu-18-10&quot;&gt;This question on &lt;em&gt;Ask Ubuntu&lt;/em&gt;&lt;/a&gt; appears to describe the same problem, and &lt;a href=&quot;https://onedrive.uservoice.com/forums/913528-onedrive-on-the-web/suggestions/35321278-fix-the-corrupted-zip-download-or-be-honest-about&quot;&gt;here’s another report on corrupted large ZIP files&lt;/a&gt; on a OneDrive-related forum. On Twitter Tyler Thorsted &lt;a href=&quot;https://twitter.com/CHLThor/status/1237416651995283457&quot;&gt;confirmed&lt;/a&gt; my results for a 5 GB ZIP file downloaded from OneDrive, adding that the Mac OS X archive utility didn’t like the file either. Still, I’m surprised I couldn’t find much else on this issue.&lt;/p&gt;

&lt;h2 id=&quot;similarity-to-apple-archive-utility-problem&quot;&gt;Similarity to Apple Archive utility problem&lt;/h2&gt;

&lt;p&gt;The problem looks superficially similar to an older issue with Apple’s Archive utility, which would write corrupt ZIP archives for cases where the ZIP64 extension is needed. From the &lt;a href=&quot;https://en.wikipedia.org/wiki/Zip_(file_format)#ZIP64&quot;&gt;WikiPedia entry on &lt;em&gt;ZIP&lt;/em&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Mac OS Sierra’s Archive Utility notably does not support ZIP64, and can create corrupt archives when ZIP64 would be required.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;More details about this are available &lt;a href=&quot;https://web.archive.org/web/20140331005235/http://www.springyarchiver.com/blog/topic/topic/203&quot;&gt;here&lt;/a&gt;, &lt;a href=&quot;https://apple.stackexchange.com/questions/221020/large-zip-files-created-in-os-x-cannot-be-opened-in-windows&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://github.com/thejoshwolfe/yauzl/issues/69&quot;&gt;here&lt;/a&gt;. Interestingly a &lt;a href=&quot;https://twitter.com/kieranjol/status/1200118816686137344&quot;&gt;Twitter thread by Kieran O’Leary&lt;/a&gt; put me on track of this issue (which I hadn’t heard of before). It’s not clear to me if the OneDrive problem is identical or even related, but because of the similarities I thought it was at least worth a mention here.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The tests presented here demonstrate how large ZIP files exported from the Microsoft OneDrive web client cannot be read by widely-used tools such as &lt;em&gt;unzip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt;. The problem only occurs for large (&amp;gt; 4 GiB) files that use the ZIP64 extension. The cause of this interoperability problem is the value of the “total number of disks” field in the “zip64 end of central dir locator”. In the OneDrive files, this value is set to 0 (zero), whereas most reader tools expect a value of 1. It is debatable whether the OneDrive files violate the &lt;a href=&quot;https://pkware.cachefly.net/webdocs/APPNOTE/APPNOTE-6.3.6.TXT&quot;&gt;ZIP format specification&lt;/a&gt;, since the spec doesn’t say anything about the permitted values of this field. Affected files can be provisionally “fixed” by changing the first byte of the “total number of disks” field in a hex editor. However, to ensure that existing files that are affected by this issue remain accessible in the long term, we need a more structural and sustainable solution. It is probably fairly trivial to modify existing ZIP reader tools and libraries such as &lt;em&gt;unzip&lt;/em&gt; and &lt;em&gt;7-zip&lt;/em&gt; to deal with these files. I’ll try to get in touch with the developers of some of these tools about this issue. Ideally things should also be fixed on Microsoft’s end. If any readers have contacts there, please bring this post to their attention!&lt;/p&gt;

&lt;h2 id=&quot;test-file&quot;&gt;Test file&lt;/h2&gt;

&lt;p&gt;I’ve created an openly-licensed test file that demonstrates the problem. It is available here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/3715394&quot;&gt;https://zenodo.org/record/3715394&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;update-17-march-2020&quot;&gt;Update (17 March 2020)&lt;/h2&gt;

&lt;p&gt;For &lt;em&gt;unzip&lt;/em&gt; I found &lt;a href=&quot;https://sourceforge.net/p/infozip/bugs/42/&quot;&gt;this ticket&lt;/a&gt; on the &lt;em&gt;Info-Zip&lt;/em&gt; issue tracker, which looks identical to the problem discussed in this post. The ticket was already created in 2013, but its current status is not entirely clear.&lt;/p&gt;

&lt;p&gt;For &lt;em&gt;7-zip&lt;/em&gt;, things are slightly complicated by the fact that for Unix &lt;a href=&quot;https://sourceforge.net/projects/p7zip/&quot;&gt;a separate &lt;em&gt;p7zip&lt;/em&gt; port&lt;/a&gt; exists, which currently is 3 major releases behind the &lt;a href=&quot;https://sourceforge.net/projects/sevenzip/&quot;&gt;main 7-zip&lt;/a&gt; project. In any case, I’ve just opened &lt;a href=&quot;https://sourceforge.net/p/p7zip/feature-requests/46/&quot;&gt;this feature request&lt;/a&gt; in the &lt;em&gt;p7zip&lt;/em&gt; issue tracker.&lt;/p&gt;

&lt;p&gt;Meanwhile Andy Jackson has been trying to &lt;a href=&quot;https://twitter.com/anjacks0n/status/1238852027045883904&quot;&gt;get this issue to the attention of Microsoft&lt;/a&gt;, so let’s see what happens from here.&lt;/p&gt;

&lt;h2 id=&quot;fix-onedrive-zip-script-update-8-june-2020&quot;&gt;Fix-OneDrive-Zip script (update 8 June 2020)&lt;/h2&gt;

&lt;p&gt;In the comments section, Paul Marquess posted a link to a small Perl script he wrote that automatically updates the “total number of disks” field of a problematic OneDrive ZIP file. The script is available here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/pmqs/Fix-OneDrive-Zip&quot;&gt;https://github.com/pmqs/Fix-OneDrive-Zip&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I ran a quick test with my openly-licensed test file, using the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;fix-onedrive-zip onedrive-zip-test-zeros.zip
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After running the script, the file was indeed perfectly readable. Thanks Paul!&lt;/p&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;14 March 2020: added analysis with Python &lt;em&gt;zipfile&lt;/em&gt;, and updated conclusions accordingly.&lt;/li&gt;
  &lt;li&gt;17 March 2020: added update with links to Info-Zip and p7zip issue trackers.&lt;/li&gt;
  &lt;li&gt;18 March 2020: added link to test file.&lt;/li&gt;
  &lt;li&gt;8 June 2020: added reference to &lt;em&gt;Fix-OneDrive-Zip&lt;/em&gt; script by Paul Marquess.&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;For unzip you can check this this by running it with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--version&lt;/code&gt; switch. If the output includes &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ZIP64_SUPPORT&lt;/code&gt; this means ZIP64 is supported. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Note that this is the big-endian representation of the signature, whereas the ZIP formation specification uses little-endian representations. See more on endianness &lt;a href=&quot;https://en.wikipedia.org/wiki/Endianness&quot;&gt;here&lt;/a&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Since the “zip64 end of central dir locator” is located near the end of the file, the quickest way to find it is to scroll to the very end of the file in the Hex editor, and then do a reverse search (“Find Previous”) from there. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2020/03/11/does-microsoft-onedrive-export-large-ZIP-files-that-are-corrupt</link>
                <guid>https://bitsgalore.org/2020/03/11/does-microsoft-onedrive-export-large-ZIP-files-that-are-corrupt</guid>
                <pubDate>2020-03-11T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Offline digital data carriers in the KB deposit collection</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/02/carriers-stillife.jpg&quot; alt=&quot;Still life of assorted data carriers&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Following earlier work on the preservation of optical media and data tapes, I recently got a request to make an inventory of offline digital data carriers in the KB’s deposit collection. The goal was to obtain approximate figures on the various carrier types in the collection. This was partially prompted by &lt;a href=&quot;https://www.netwerkdigitaalerfgoed.nl/activiteiten/digitaal-erfgoed-houdbaar/bedreigd-digitaal-erfgoed-op-fysieke-dragers/&quot;&gt;a project on at-risk digital heritage on physical carriers&lt;/a&gt; by the &lt;a href=&quot;https://www.netwerkdigitaalerfgoed.nl/en/&quot;&gt;Dutch Digital Heritage Network&lt;/a&gt; (NDE) that the KB is participating in. This blog post presents the results.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;data-carriers-in-the-kb-catalogue&quot;&gt;Data carriers in the KB catalogue&lt;/h2&gt;

&lt;p&gt;The KB catalogue is the primary source of information about carriers in our collection. It is fully searchable using the &lt;a href=&quot;https://www.loc.gov/standards/sru/&quot;&gt;SRU protocol&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;carrier-type-codes-not-based-on-controlled-vocabulary&quot;&gt;Carrier type codes not based on controlled vocabulary&lt;/h2&gt;

&lt;p&gt;The main piece of information to go by here is the value of the Dublin Core &lt;em&gt;extent&lt;/em&gt; field&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. However, in our catalogue the values of this field are not set from a controlled vocabulary, which means that one particular carrier type can be represented by different values. For example, here are some variations I found for CD-ROMs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;cdrom&lt;/li&gt;
  &lt;li&gt;cd-rom&lt;/li&gt;
  &lt;li&gt;1 cd-rom&lt;/li&gt;
  &lt;li&gt;2 cd-roms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For 5.25” floppy disks I came across:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;1 diskette 5.25”&lt;/li&gt;
  &lt;li&gt;1 floppydisk 5.25”&lt;/li&gt;
  &lt;li&gt;1 floppydisk 5.25” in doos&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This lack of consistency makes searching for specific carrier types rather difficult, and as a result most of the queries I used for this inventory are the result of trial and error.&lt;/p&gt;

&lt;h2 id=&quot;catalogue-record-can-represent-multiple-carriers&quot;&gt;Catalogue record can represent multiple carriers&lt;/h2&gt;

&lt;p&gt;In the examples above you may have noticed how in some cases the value represents multiple carriers (e.g. “2 cd-roms”). This is because our catalogue provides access at the level of publications, but not at the level of individual carriers that are part of a publication! A typical example to illustrate this, is a physical book with 2 supplemental CD-ROMs. In this case, the catalogue record represents the whole publication, and the &lt;em&gt;extent&lt;/em&gt; field will have a value like “2 cd-roms”. As a result, the catalogue cannot be readily used to get precise figures on carrier types. In most cases the number of carriers for a specific carrier type will be considerably greater than the number of matching records, since an individual record (publication) may include multiple carriers. This is not really a problem for the current exercise, as its objective is limited to getting approximate figures only.&lt;/p&gt;

&lt;h2 id=&quot;results&quot;&gt;Results&lt;/h2&gt;

&lt;p&gt;I summarised the results in the table below. Here I made a subdivision between optical, magnetic and electronic carrier types. The table contains 4 columns:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Carrier type:&lt;/strong&gt; the types listed here largely follow those proposed for an upcoming survey that will be launched as part of the NDE project on at-risk digital heritage on physical carriers (with some additions).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Category:&lt;/strong&gt; this indicates the carrier type category (optical, magnetic or electronic).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Query:&lt;/strong&gt; this is the &lt;a href=&quot;https://www.loc.gov/standards/sru/cql/&quot;&gt;SRU&lt;/a&gt; query I used to estimate the number of catalogue records for this carrier type. Follow the underlying hyperlinks to run the query yourself. Note that this field is empty for carrier types that -to the best of our knowledge- are not present in our deposit collection.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;strong&gt;Matching records:&lt;/strong&gt; this is the number of matching records in the catalogue. As explained in the previous section, this does not necessarily reflect the actual number of carriers, which may well be greater by a factor 2 (or even more).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Carrier type&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Category&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Query&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Matching records&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CD-ROM&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optical&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;cdrom*%20cd-rom*&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “cdrom* cd-rom*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;8109&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;DVD&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optical&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;dvd*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “dvd*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;711&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Blu-Ray&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optical&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;bluray*%20blu-ray*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “bluray* blu-ray*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Audio CD&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optical&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=type%20any%20&amp;quot;geluidsdrager&amp;quot;%20and%20extent%20any%20&amp;quot;cd*%20compact&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;type any “geluidsdrager” and extent any “cd* compact”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4605&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CD-i&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optical&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;cdi*%20cd-i*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “cdi* cd-i*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;44&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optical carrier, unspecified&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optical&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;optisch*%20schijf&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “optisch* schijf”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;23&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Tape, unspecified&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;tape*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “tape*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Compact cassette, data (datassette)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20%22cassetteband*%20datacassette*%22&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “cassetteband*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;6&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Floppy disk, 8”&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Floppy disk, 5.25”&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;flop*%20diskette*&amp;quot;%20and%20extent%20any%20&amp;quot;5.25*%205,25*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “flop* diskette*” and extent any “5.25* 5,25*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;184&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Floppy disk, 3.5”&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;flop*%20diskette*&amp;quot;%20and%20extent%20any%20&amp;quot;3.5*%203,5*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “flop* diskette*” and extent any “3.5* 3,5*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1194&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Floppy disk, unspecified&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;flop*%20diskette*&amp;quot;%20not%20extent%20any%20&amp;quot;5.25*%203.5*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “flop* diskette*” not extent any “5.25* 3.5*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;822&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Zip Disk&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;zipdis*%20zip-dis*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “zipdis* zip-dis*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Hard disk&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Magnetic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;hard*&amp;quot;%20and%20extent%20any%20&amp;quot;schijf%20schijv*%20disk*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “hard*” and extent any “schijf schijv* disk*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Compact Flash card&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Electronic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Sony Memory stick&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Electronic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SD card&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Electronic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;USB thumb drive&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Electronic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://jsru.kb.nl/sru/sru?query=extent%20any%20&amp;quot;usb*&amp;quot;&amp;amp;x-collection=GGC&amp;amp;maximumRecords=10&quot;&gt;extent any “usb*”&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;44&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Solid-State drive&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Electronic&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;In the following sections I will highlight some of the most interesting observations that can be made from the table.&lt;/p&gt;

&lt;h2 id=&quot;optical-carriers&quot;&gt;Optical carriers&lt;/h2&gt;

&lt;p&gt;Unsurprisingly, optical carriers make up the majority of offline digital carriers in the KB deposit collection. An effort to preserve the contents of these carriers using the &lt;a href=&quot;/2017/06/19/image-and-rip-optical-media-like-a-boss&quot;&gt;&lt;em&gt;Iromlab&lt;/em&gt;&lt;/a&gt; software is currently ongoing, but this does not cover Blu-Ray discs. At the current numbers we could probably just image them manually.&lt;/p&gt;

&lt;h2 id=&quot;development-of-optical-carriers-since-2015&quot;&gt;Development of optical carriers since 2015&lt;/h2&gt;

&lt;p&gt;Since I had already queried the catalogue for optical carriers &lt;a href=&quot;https://zenodo.org/record/292341&quot;&gt;as part of an investigation I did in 2015&lt;/a&gt;, I also compared the current figures against those in 2015. The following figure shows the result for the most prevalent optical carrier types:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/optical-2015-2020.png&quot; alt=&quot;&amp;quot;Optical carrier types, 2015 vs 2020&amp;quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;And here in table form:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Carrier type&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Matching records (2015)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Matching records (2020)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Increase&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CD-ROM&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7980&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;8109&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;129&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Audio CD&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4065&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4605&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;540&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;DVD&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;554&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;711&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;157&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The increase of the number of DVD records (an 28% increase) is particularly noteworthy. In absolute terms the number of audio CD records has increased even more. The growth of the CD-ROM collection has clearly levelled off at an increase of only 129 records over 5 years.&lt;/p&gt;

&lt;h2 id=&quot;magnetic-carriers&quot;&gt;Magnetic carriers&lt;/h2&gt;

&lt;p&gt;In the magnetic carriers category the number of publications with floppy disks is noteworthy. Assuming each of these publications contains between 1 and 2 disks on average, the total number of floppy disks would be in the range 2200 - 4400. The majority of these are 3.5” disks, but for about 38% the exact type cannot be established from the catalogue alone.&lt;/p&gt;

&lt;p&gt;An interesting curiosity is a handful of compact cassettes with software for the &lt;a href=&quot;https://en.wikipedia.org/wiki/ZX_Spectrum&quot;&gt;&lt;em&gt;ZX Spectrum&lt;/em&gt;&lt;/a&gt; home computer.&lt;/p&gt;

&lt;p&gt;We also appear to have 2 tapes of unknown format, and 2 hard disks. These need further investigation.&lt;/p&gt;

&lt;h2 id=&quot;electronic-carriers&quot;&gt;Electronic carriers&lt;/h2&gt;

&lt;p&gt;Here we have 44 USB thumb drives, and we should probably consider imaging them sooner rather than later. We don’t appear to have any of the other carrier types in this category.&lt;/p&gt;

&lt;h2 id=&quot;final-remarks&quot;&gt;Final remarks&lt;/h2&gt;

&lt;p&gt;Despite the obvious limitations of the methodology, the above inventory provides a tentative overview of offline digital carriers in the KB deposit collection that will probably be useful for preservation planning. It might also guide future efforts at saving information on “at risk” carriers.&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Oddly, this field is meant to record &lt;a href=&quot;https://www.dublincore.org/specifications/dublin-core/dcmi-terms/terms/extent/&quot;&gt;“The size or duration of the resource”&lt;/a&gt; as per the Dublin Core specification. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2020/02/20/offline-digital-carriers-kb-deposit-collection</link>
                <guid>https://bitsgalore.org/2020/02/20/offline-digital-carriers-kb-deposit-collection</guid>
                <pubDate>2020-02-20T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Web domain geolocation and spatial analysis with QGIS</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2020/02/domains-provinces-map.png&quot; alt=&quot;Map of domain locations&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;A few weeks ago one of my web archiving colleagues approached me with an interesting question. From a list of Dutch web domains, he wanted to identify the (Dutch) province in which each domain is hosted. He was particularly interested in domains hosted in the province of &lt;a href=&quot;https://en.wikipedia.org/wiki/Friesland&quot;&gt;Friesland&lt;/a&gt;. After some experimentation I was able to answer this question using a two-step procedure:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Geo-locate the web domains using a custom Python script.&lt;/li&gt;
  &lt;li&gt;Combine the results of the geolocation exercise with openly available geographical data using &lt;a href=&quot;https://en.wikipedia.org/wiki/QGIS&quot;&gt;&lt;em&gt;QGIS&lt;/em&gt;&lt;/a&gt;, an open-source geographical information system (GIS).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Even though the outcome of the analysis is not particularly interesting, I imagine both the geolocation methodology and the GIS analysis steps might be useful to others. So, this blog post is primarily intended as a tutorial that gives a walkthrough of the steps I followed.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;web-domain-geolocation&quot;&gt;Web domain geolocation&lt;/h2&gt;

&lt;p&gt;In Python, the &lt;a href=&quot;https://github.com/maxmind/GeoIP2-python&quot;&gt;&lt;em&gt;GeoIP2&lt;/em&gt;&lt;/a&gt; module can be used to get geolocation data (latitude, longitude, but also country and city names) for any IP address. IP addresses can be queried through either a web service, or using a local database. I went for the local database option, which requires you to download an up-to-date version of the &lt;em&gt;GeoLite2&lt;/em&gt; City database, which can downloaded from &lt;a href=&quot;https://dev.maxmind.com/geoip/geoip2/geolite2&quot;&gt;the &lt;em&gt;MaxMind&lt;/em&gt; developer site&lt;/a&gt;. To download the database you first need to create an account (which is free). Note that this database is widely used; for example, the British Library used it for geo-locating their 2014 domain crawl&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;ip-lookup&quot;&gt;IP lookup&lt;/h2&gt;

&lt;p&gt;Since the &lt;em&gt;GeoIP2&lt;/em&gt; module expects IP addresses as input, we first need to establish the IP address that is associated with each web domain in our list. In theory it should be possible to do IP lookups natively in &lt;em&gt;Python&lt;/em&gt; (e.g. see &lt;a href=&quot;https://stackoverflow.com/questions/6422907/ip-address-by-domain-name&quot;&gt;here&lt;/a&gt;). However, after running into various problems, I ultimately wrote a simple wrapper function around the Unix &lt;a href=&quot;https://linux.die.net/man/1/host&quot;&gt;&lt;em&gt;host&lt;/em&gt; command&lt;/a&gt;. Unfortunately this does mean the script will only run in Unix/Linux environments.&lt;/p&gt;

&lt;h2 id=&quot;geolocation-script&quot;&gt;Geolocation script&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://gist.github.com/bitsgalore/b05c73934aece90e5a1b2a53fcce6f5b&quot;&gt;complete geolocation script is available here&lt;/a&gt;. It takes three command-line arguments:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;An input file. This is a text file, with each line containing one web domain. It has the following format (note that these are not valid URLs because they don’t include a leading scheme/protocol component):
    &lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kruisbandkliniek.nl
anima-communicatie.nl
nieuwebrabander.nl
isisschuurman.nl
wamail.nl
kopenzonderklussen.nl
muziekles-laren.nl
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;An output file.&lt;/li&gt;
  &lt;li&gt;The location of the database file.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The output file is a comma-delimited text file. For each domain, it reports:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;A &lt;em&gt;Boolean&lt;/em&gt; flag that indicates whether the domain can be mapped to an IP address.&lt;/li&gt;
  &lt;li&gt;A country ISO code (if available).&lt;/li&gt;
  &lt;li&gt;A city name (if available).&lt;/li&gt;
  &lt;li&gt;Latitude and longitude (if available).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;qgis&quot;&gt;QGIS&lt;/h2&gt;

&lt;p&gt;After I ran the script on the list of Dutch domain names, I noticed that city name values were often missing. However, latitude and longitude were always available (provided that a domain can be mapped to an IP address), so that’s what we’ll work with here. In order to answer the question that started this whole exercise (“which domains are hosted in the province of Friesland”), we need to do some additional spatial analysis. &lt;a href=&quot;https://en.wikipedia.org/wiki/Geographic_information_system&quot;&gt;Geographic Information Systems&lt;/a&gt; (GIS) are a class of software that are specifically suited to this. So, I downloaded and installed the free and open-source &lt;a href=&quot;https://qgis.org/en/site/&quot;&gt;QGIS&lt;/a&gt; software.&lt;/p&gt;

&lt;p&gt;A general discussion of QGIS is beyond the scope of this blog post; however the &lt;em&gt;Programming Historian&lt;/em&gt; website has &lt;a href=&quot;https://programminghistorian.org/en/lessons/qgis-layers&quot;&gt;a good introduction to QGIS&lt;/a&gt;&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. For the analyses covered in this blog post I used QGIS 2.8.6 (which is the slightly outdated version that is in the default Ubuntu software repositories).&lt;/p&gt;

&lt;h2 id=&quot;visualise-the-point-data&quot;&gt;Visualise the point data&lt;/h2&gt;

&lt;p&gt;As a first step, let’s visualise the output of the geolocation script. Launch QGIS, then from the &lt;em&gt;Layer&lt;/em&gt; menu select &lt;em&gt;Add Layer&lt;/em&gt; / &lt;em&gt;Add Delimited Text Layer …&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/add-text-layer.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This opens up a dialog window where you can select the output file of the geolocation script. There are also some options you can set here, but in this case you can stick to the default values:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/textlayer-dialog.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Under &lt;em&gt;Geometry definition&lt;/em&gt; you can select the columns with  the X- and Y-co-ordinates. As it turns out, QGIS is clever enough to figure these out by itself from the column headings in the file. Click &lt;em&gt;OK&lt;/em&gt; to start importing the file. Note that if any records in the file don’t contain latitude/longitude values, you will see an error message like this:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Errors in file /home/johan/test-geolocatDomains/out_NL_geo_Johan.csv
594 records discarded due to missing geometry definitions
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;It is safe to ignore this, as any remaining records are imported correctly. After closing the error message, you need to specify the coordinate reference system in the following dialog&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/crs-points.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Since the point data file uses geographic coordinates (latitude/longitude pairs), select &lt;em&gt;WGS 84&lt;/em&gt; (which is also the default) and press &lt;em&gt;OK&lt;/em&gt;. If all goes well you will now see the imported points:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/points-imported.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Note that you can zoom in to parts of the map view for a more detailed look. Before going any further, this is a good moment to save our QGIS project; to do so open the &lt;em&gt;Project&lt;/em&gt; menu on the left, select &lt;em&gt;Save&lt;/em&gt;, type a file name and the click on the &lt;em&gt;Save&lt;/em&gt; button. Done!&lt;/p&gt;

&lt;h2 id=&quot;add-a-provinces-layer&quot;&gt;Add a provinces layer&lt;/h2&gt;

&lt;p&gt;The objective of this exercise was to link web domains to provinces. A free, generalised vector layer of Dutch province boundaries is available &lt;a href=&quot;https://www.nationaalgeoregister.nl/geonetwork/srv/dut/catalog.search#/metadata/e73b01f6-28c7-4bb7-a782-e877e8113e2c&quot;&gt;here&lt;/a&gt; in a number of formats. I downloaded the &lt;a href=&quot;https://en.wikipedia.org/wiki/GeoPackage&quot;&gt;GeoPackage&lt;/a&gt; version (which is essentially an SQLite database). After unzipping the file, go to the &lt;em&gt;Layers&lt;/em&gt; menu and select &lt;em&gt;Add Vector Layer …&lt;/em&gt;.:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/add-vector.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Then open the downloaded GeoPackage&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt; file in the dialog that appears:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/add-vector-2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After opening the file, the map view looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/provinces-after-import.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Note that the provinces layer is visible as a tiny purple blob at the center of the image! The reason for this is that the default zoom level of the map view covers the whole geographical extent of the domain points data, which apparently includes locations all over the world! So let’s change that. In the &lt;em&gt;Layers&lt;/em&gt; panel, hover over the provinces layer, right-click and then select &lt;em&gt;Zoom to Layer&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/zoom-provinces.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/zoomed-provinces.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As you see, provinces layer now obscures the point data. You can fix this by dragging the provinces layer to the bottom in the &lt;em&gt;Layers&lt;/em&gt; panel:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/points-top.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;customise-appearance-of-provinces-layer&quot;&gt;Customise appearance of provinces layer&lt;/h2&gt;

&lt;p&gt;Now let’s improve the appearance of the provinces layer. In the &lt;em&gt;Layers&lt;/em&gt; panel, hover over the provinces layer, right-click and then select &lt;em&gt;Properties&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/layer-properties.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the dialog that appears, first click on &lt;em&gt;Style&lt;/em&gt; in the left pane. Click on &lt;em&gt;Single Symbol&lt;/em&gt; in the drop-down menu, change it to &lt;em&gt;Categorized&lt;/em&gt;, and then click on the &lt;em&gt;Apply&lt;/em&gt; button:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/style-1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now click on the &lt;em&gt;Column&lt;/em&gt; drop-down menu, and select the column (attribute/field) that represents the map classes. In this case I used the &lt;em&gt;PROV_CODE&lt;/em&gt; column. Next, click on the &lt;em&gt;Classify&lt;/em&gt; button, which automatically assigns colours and legend values:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/categorize.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Let’s also add some labels. Click on &lt;em&gt;Labels&lt;/em&gt; in the left pane, and tick the &lt;em&gt;Label this layer with&lt;/em&gt; checkbox. Use the drop-down menu to select the column (attribute/field) that is used to assign the labels. In my case I used the &lt;em&gt;PROV_NAAM&lt;/em&gt; (“province name”) column:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/labels.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Click on the &lt;em&gt;Apply&lt;/em&gt; button again, and then on &lt;em&gt;OK&lt;/em&gt;. Our map view now looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/provinces-points.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Which is pretty decent! You can use the &lt;em&gt;Save as Image …&lt;/em&gt; item in the &lt;em&gt;Project&lt;/em&gt; menu to export the map layout into a variety of image formats.&lt;/p&gt;

&lt;h2 id=&quot;spatial-analysis&quot;&gt;Spatial analysis&lt;/h2&gt;

&lt;p&gt;Having a pretty-looking map is nice, but in order to know which domains are hosted in which provinces we need to do some actual analysis. More specifically, for each web domain (point) we need to extract the corresponding attributes from the provinces vector layer. This is one of the most basic functionalities of QGIS. However, most of the analysis tools of QGIS require that all input layers have the same coordinate reference system. This is not the case here: the geographical coordinates of our point data are defined as latitude/longitude pairs, whereas the provinces vector layer uses the &lt;a href=&quot;https://www.spatialreference.org/ref/epsg/amersfoort-rd-new/&quot;&gt;Amersfoort / RD New&lt;/a&gt; coordinate system (which is commonly used for maps of the Netherlands)!&lt;/p&gt;

&lt;h2 id=&quot;transform-point-data-to-coordinate-system-of-provinces-layer&quot;&gt;Transform point data to coordinate system of provinces layer&lt;/h2&gt;

&lt;p&gt;So, before we go any further we first need to transform our point data to the province layer’s coordinate system&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. In the &lt;em&gt;Layers&lt;/em&gt; panel, hover over the points layer and right-click &lt;em&gt;Save As …&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/points-save-as.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Set &lt;em&gt;Format&lt;/em&gt; to &lt;em&gt;Geography Markup Language [GML]&lt;/em&gt; and specify a file name and location. In the &lt;em&gt;CRS&lt;/em&gt; dropdown menu, select the coordinate system of the provinces layer (here: &lt;em&gt;Amersfoort / RD New&lt;/em&gt;). Leave all other options to their default values:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/points-save-as-2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After pressing &lt;em&gt;OK&lt;/em&gt; the new layer is automatically added to the map view.&lt;/p&gt;

&lt;p&gt;While at it, let’s also change the project coordinate system to &lt;em&gt;Amersfoort / RD New&lt;/em&gt;&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;: in the &lt;em&gt;Project&lt;/em&gt; menu, click on &lt;em&gt;Project Properties …&lt;/em&gt;. In the dialog that now appears, click on &lt;em&gt;Amersfoort / RD New&lt;/em&gt; in the &lt;em&gt;Recently used coordinate reference systems&lt;/em&gt; pane, and then press &lt;em&gt;Apply&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/set-project-crs.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now press &lt;em&gt;OK&lt;/em&gt;, and you should view something like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/mapview-amersfoort.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;join-attributes-of-point-and-province-layers&quot;&gt;Join attributes of point and province layers&lt;/h2&gt;

&lt;p&gt;We’re now ready to combine the attributes of the point and vector layers. From the &lt;em&gt;Vector&lt;/em&gt; menu, select &lt;em&gt;Data Management Tools&lt;/em&gt;, and then &lt;em&gt;Join Attributes by Location …&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/join-attributes-1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Next, set the &lt;em&gt;Target vector layer&lt;/em&gt; to the layer with the (transformed) point data, and the &lt;em&gt;Join vector layer&lt;/em&gt; to the provinces vector layer. Leave &lt;em&gt;Attribute Summary&lt;/em&gt; to its default value (&lt;em&gt;Take attributes of first located feature&lt;/em&gt;), specify an output file and set the &lt;em&gt;Output table&lt;/em&gt; setting to &lt;em&gt;Keep all records&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/join-attributes-2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now press &lt;em&gt;OK&lt;/em&gt;. After a while this dialog pops up:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/layertotoc.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Click &lt;em&gt;Yes&lt;/em&gt;, and then close the &lt;em&gt;Join Attributes&lt;/em&gt; dialog window.&lt;/p&gt;

&lt;h2 id=&quot;export-to-text-file&quot;&gt;Export to text file&lt;/h2&gt;

&lt;p&gt;As a final step we export the layer that we created in the previous step to a delimited text file. In the &lt;em&gt;Layers&lt;/em&gt; panel, hover over the new layer and right-click &lt;em&gt;Save As …&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/exportjoineddata.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In the dialog that now pops up, set &lt;em&gt;Format&lt;/em&gt; to &lt;em&gt;Comma Separated Value&lt;/em&gt;, specify a file name and press &lt;em&gt;OK&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/savevectorlayeras.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;You can now import the resulting text file in your favourite spreadsheet application. Here’s what it looks like in LibreOffice Calc:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/domains-provinces-spreadheet.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As you can see here, each domain record now contains a column with the corresponding province name (and some additional province-specific columns), and from this we can easily identify the domains that are hosted in Friesland (or any other province):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2020/02/domains-fryslan.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;As I’m originally a geographer by training, this was a nice opportunity to go back back to my roots a bit and muck around with geographical data. The last time I did any serious work with geographic information systems was around 2007; back then even simple analyses like this one needed expensive proprietary software. I hadn’t used QGIS before, and even though I’ve only scratched the surface here it looks like a really useful addition to the toolbox of anyone working with geodata.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://gist.github.com/bitsgalore/b05c73934aece90e5a1b2a53fcce6f5b&quot;&gt;Python script for geolocation of web domains&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://programminghistorian.org/en/lessons/qgis-layers&quot;&gt;Installing QGIS 2.0 and Adding Layers (the Programming Historian)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://docs.qgis.org/3.4/en/docs/gentle_gis_introduction/coordinate_reference_systems.html&quot;&gt;Coordinate Reference Systems (QGIS Wiki)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Geo-location in the 2014 UK Domain Crawl: &lt;a href=&quot;https://blogs.bl.uk/webarchive/2015/07/geo-location-in-the-2014-uk-domain-crawl.html&quot;&gt;https://blogs.bl.uk/webarchive/2015/07/geo-location-in-the-2014-uk-domain-crawl.html&lt;/a&gt; &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Note that this is based on a pretty old version of QGIS, so some things may have changed since then. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;A good introduction to coordinate reference systems can be found here: &lt;a href=&quot;https://docs.qgis.org/3.4/en/docs/gentle_gis_introduction/coordinate_reference_systems.html&quot;&gt;https://docs.qgis.org/3.4/en/docs/gentle_gis_introduction/coordinate_reference_systems.html&lt;/a&gt; &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;The other formats will most likely work fine as well, but I just selected GeoPackage, as unlike the Shapefile format it’s an open standard. &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Or the other way round. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;This step is not necessary for the analysis, but I just did this because I don’t really like the way the map view looks in the WGS 84 coordinate system. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2020/02/11/web-domain-geolocation-and-spatial-analysis</link>
                <guid>https://bitsgalore.org/2020/02/11/web-domain-geolocation-and-spatial-analysis</guid>
                <pubDate>2020-02-11T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Recovering '90s Data Tapes - Experiences From the KB Web Archaeology project (iPres 2019 paper)</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/tapes-dds-dlt.jpg&quot; alt=&quot;DDS-1 (left) and DLT-IV (right) tape&quot; /&gt;
&lt;/figure&gt;

&lt;p&gt;Earlier this year I published &lt;a href=&quot;/2019/01/31/roll-the-tape-recovering-90s-data-tapes-in-bitcurator&quot;&gt;this blog post&lt;/a&gt; on the recovery of data from ’90s data tapes. I will give a presentation on this during &lt;a href=&quot;https://ipres2019.org/&quot;&gt;the upcoming &lt;em&gt;iPres&lt;/em&gt; 2019 conference&lt;/a&gt;, and wrote a paper that discusses this work in more detail than my earlier blog post. The original paper (in PDF format) can be found &lt;a href=&quot;https://ipres2019.org/static/pdf/iPres2019_paper_9.pdf&quot;&gt;here&lt;/a&gt;. The paper references a wealth of useful resources, but some of these are not easily accessible because the &lt;a href=&quot;https://en.wikipedia.org/wiki/LaTeX&quot;&gt;&lt;em&gt;LaTeX&lt;/em&gt;&lt;/a&gt; template used does not handle hyperlinks well (this will be fixed in the final, post-conference version of the paper). Because of this I’ve created a web-friendly version of the paper below.&lt;/p&gt;

&lt;!-- more --&gt;
&lt;hr /&gt;

&lt;h2 id=&quot;abstract&quot;&gt;Abstract&lt;/h2&gt;

&lt;p&gt;The recovery of digital data from tape formats from the mid
to late ’90s is not well covered by existing digital preservation and
forensics literature. This paper addresses this knowledge gap with a
discussion of the hardware and software that can be used to read such
tapes. It introduces &lt;em&gt;tapeimgr&lt;/em&gt;, a user-friendly software application
that allows one to read tapes in a format-agnostic manner. It also
presents workflows that integrate the discussed hardware and software
components. It then shows how these workflows were used to recover the
contents of a set of DDS-1, DDS-3 and DLT-IV tapes from the mid to late
’90s. These tapes contain the source data of a number of “lost” web
sites that the National Library of the Netherlands (KB) is planning to
reconstruct at a later stage as part of its ongoing Web Archaeology
project. The paper also presents some first results of sites from 1995
that have already been reconstructed from these tapes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keywords&lt;/strong&gt; – tapes, digital forensics, web archaeology&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conference Topics&lt;/strong&gt; – The Cutting Edge: Technical Infrastructure and
Implementation&lt;/p&gt;

&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;When the National Library of the Netherlands (hereafter: KB) launched
its web archive in 2007, many sites from the “early” Dutch web had
already gone offline. As a result, the time period between (roughly)
1992 and 2000 is under-represented in the web archive. To improve the
coverage of web sites from this historically important era, the KB has
started to investigate the use of tools and methods from the emerging
field of “web archaeology” &lt;a href=&quot;https://doi.org/10.1177/0955749017725930&quot;&gt;[1]&lt;/a&gt;.
Analogous to how archaeologists study past
cultures from excavated physical artefacts, web archaeology is about
reconstructing “lost” web sites using data that are recovered from old
(and often obsolete) physical carriers. It is worth noting that Ross and
Gow introduced the concept of “digital archaeology” (of which web
archaeology is a special case) as early as 1999 &lt;a href=&quot;http://www.ukoln.ac.uk/services/elib/papers/supporting/pdf/p2.pdf&quot;&gt;[2]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Over the last year, the KB web archiving team has reached out to a
number of creators of “early” Dutch web sites that are no longer online.
Many of these creators still possess offline information carriers with
the original source data of their sites. This would potentially allow us
to reconstruct those sites, and then ingest them into the web archive.
The overall approach would be similar to how we already reconstructed
the first Dutch web index &lt;em&gt;NL-Menu&lt;/em&gt; in 2018 &lt;a href=&quot;https://www.bitsgalore.org/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited&quot;&gt;[3]&lt;/a&gt;,
&lt;a href=&quot;https://www.bitsgalore.org/2018/07/11/crawling-offline-web-content-the-nl-menu-case&quot;&gt;[4]&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A few of these creators have already provided us with sample sets of
carriers which, though limited in size, comprise a range of physical
formats, such as CD-ROMs, floppy disks, ZIP disks, USB thumb drives, and
(internal) hard disks. One sample set was provided to us by the former
owners of &lt;em&gt;xxLINK&lt;/em&gt;, a web development and hosting company that was
founded in 1994. It was the first Dutch company that provided these
services, and throughout the ’90s &lt;em&gt;xxLINK&lt;/em&gt; created web sites for a large
number of well-known Dutch companies and institutions&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. A
particularly interesting feature of the &lt;em&gt;xxLINK&lt;/em&gt; sample set is that it
includes 33 data tapes.&lt;/p&gt;

&lt;p&gt;There is a relative wealth of digital preservation and digital forensics
literature on the recovery of data from physical carriers. Examples
include Ross and Gow &lt;a href=&quot;http://www.ukoln.ac.uk/services/elib/papers/supporting/pdf/p2.pdf&quot;&gt;[2]&lt;/a&gt;,
Elford et al. &lt;a href=&quot;http://archive.ifla.org/IV/ifla74/papers/084-Webb-en.pdf&quot;&gt;[5]&lt;/a&gt;,
Woods and Brown &lt;a href=&quot;https://kamwoods.net/publications/woodsbrownarch09.pdf&quot;&gt;[6]&lt;/a&gt;,
Woods et al. &lt;a href=&quot;http://doi.acm.org/10.1145/1998076.1998088&quot;&gt;[7]&lt;/a&gt;,
Lee et al. &lt;a href=&quot;http://www.dlib.org/dlib/may12/lee/05lee.html&quot;&gt;[8]&lt;/a&gt;,
John &lt;a href=&quot;http://dx.doi.org/10.7207/twr12-03&quot;&gt;[9]&lt;/a&gt;
and Pennock et al. &lt;a href=&quot;https://doi.org/10.5281/zenodo.1321629&quot;&gt;[10]&lt;/a&gt;.
For many carrier types published workflow descriptions are readily available
(see e.g. Prael and Wickner &lt;a href=&quot;https://practicaltechnologyforarchives.org/issue4_prael_wickner/&quot;&gt;[11]&lt;/a&gt;,
Salo &lt;a href=&quot;https://radd.dsalo.info/wp-content/uploads/2017/10/BuildDocumentation.pdf&quot;&gt;[12]&lt;/a&gt;
and the workflows published by the &lt;em&gt;BitCurator Consortium&lt;/em&gt; &lt;a href=&quot;https://bitcuratorconsortium.org/workflows&quot;&gt;[13]&lt;/a&gt;,
to name but a few). Even though these cover a wide range of physical
carrier types, the existing literature provides surprisingly little
information on how to recover data from legacy tape formats. One of the
few exceptions are De Haan &lt;a href=&quot;https://doi.org/10.5281/zenodo.1255965&quot;&gt;[14]&lt;/a&gt;
and De Haan et al. &lt;a href=&quot;https://hart.amsterdam/image/2017/11/17/20171116_freeze_diy_handboek.pdf&quot;&gt;[15]&lt;/a&gt;,
who describe how they rescued 11 GB worth of data from three DLT tapes.
However, they do not provide much detail about the hardware and software setup
they used for this.&lt;/p&gt;

&lt;p&gt;Reading these legacy tape formats presents a number of challenges.
First, it requires specific hardware that is now largely obsolete. This
includes not only the actual tape readers, but also host adapters that
are needed to connect a tape reader to a modern forensic workstation,
cables and adapter plugs. Because of this, finding the “right” hardware
setup is often not straightforward. Furthermore, since the original
software that was used to write (and read) legacy data tapes may not be
available anymore (if it is known at all), the tapes should be read in a
format-agnostic way at the block device level. This can be done with
existing software tools, but these tools are not very user-friendly, and
the resulting workflows can be quite unwieldy. Also, the logical
interpretation of data files that have been recovered from tape requires
some additional work. Finally, even though there are still various
online resources that cover reading these tapes&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;, the information
they provide is often fragmentary, or geared towards specific backup
software or hardware. This is especially true for older resources that
date back to the time when these tape formats were in heavy use.&lt;/p&gt;

&lt;p&gt;Hence, there appears to be a knowledge gap. The overall aim of this
paper is to fill this gap by discussing the hardware and software that
can be used to read such tapes, and by presenting practical workflows
that allowed us to recover the information from the &lt;em&gt;xxLINK&lt;/em&gt; tapes.
These workflows are largely based on current hard- and software. They
are also fully open source, and can be easily integrated into
Linux-based platforms, including the popular &lt;em&gt;BitCurator&lt;/em&gt;&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;
environment.&lt;/p&gt;

&lt;h2 id=&quot;outline&quot;&gt;Outline&lt;/h2&gt;

&lt;p&gt;This paper starts with a brief overview of the tape formats in the
&lt;em&gt;xxLINK&lt;/em&gt; sample set. This is followed by a discussion of the hardware
that is needed for accessing tapes like these. This section also
provides some suggestions that will hopefully be useful to others who
are starting similar tape-related work. It then suggests a
format-agnostic procedure for reading the data on the tapes, and
presents a new software application that was developed specifically for
reading tapes in a simple and user-friendly manner. Next follows a
discussion of how this hardware and software setup were integrated into
workflows, and how these workflows were used to recover the data on the
&lt;em&gt;xxLINK&lt;/em&gt; tapes. This is followed by two sections that explain the
further processing of the recovered data: the extraction of the
resulting container files, and the subsequent reconstruction of any
“lost” web sites whose underlying data are enclosed in them. This
section also shows some first results of sites that were recovered from
a 1995 tape. The closing section summarizes the main conclusions.&lt;/p&gt;

&lt;h2 id=&quot;tape-formats&quot;&gt;Tape formats&lt;/h2&gt;

&lt;p&gt;The majority (19) of the tapes in the &lt;em&gt;xxLINK&lt;/em&gt; sample set are DDS tapes,
most of which were written in 1995. DDS (Digital Data Storage) is a
family of tape formats that are based on Digital Audio Tape (DAT). Using
the product codes I was able to identify the majority of these DDS tapes
as DDS-1, which was the first generation of DDS. DDS-1 was introduced in
1989, and has a maximum capacity of 2 GB (uncompressed), or 4 GB
(compressed). Two tapes could be identified as DDS-3, a format which was
introduced in 1996 with a maximum capacity of 12 GB (uncompressed), or
24 GB (compressed). A total of 7 DDS generations have been released, the
final one being DAT-320 in 2009&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Backward read compatibility of DDS
drives is typically limited to 2 or 3 generations&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. The &lt;em&gt;xxLINK&lt;/em&gt; set
also contains 14 DLT-IV tapes which were mostly written in 1999. DLT-IV
is a member of the Digital Linear Tape (DLT) family of tape formats,
which dates back to 1984. DLT-IV was first introduced in 1994&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;, and
has a capacity of up to 40 GB (uncompressed), or 80 GB (compressed)&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;7&lt;/a&gt;&lt;/sup&gt;.
Figure 1 shows what these tapes look like.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/tapes-dds-dlt.jpg&quot; alt=&quot;DDS-1 (left) and DLT-IV (right) tape&quot; /&gt;
  &lt;figcaption&gt;Figure 1: DDS-1 (left) and DLT-IV (right) tape&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;hardware&quot;&gt;Hardware&lt;/h2&gt;

&lt;p&gt;For all data recovery workflows that are part of the web archaeology
project I set up a dedicated forensic workstation that is running the
&lt;em&gt;BitCurator&lt;/em&gt; environment. Reading the vintage tape formats in the
&lt;em&gt;xxLINK&lt;/em&gt; sample set requires some specific additional hardware, most of
which can be bought used online at a low to moderate cost. Luckily, it
turned out our IT department was still in the possession of an old
(DDS-1-compatible) DDS-2 drive, as well as a DLT-IV drive. Both drives
are shown in Figure 2.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09//tapedrives-dds-dlt.jpg&quot; alt=&quot;DLT-IV (bottom) and DDS-2 (top) tape drives&quot; /&gt;
  &lt;figcaption&gt;Figure 2: DLT-IV (bottom) and DDS-2 (top) tape drives&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;In order to read the DDS-3 tapes, I purchased an additional used DAT-72
drive that has backward-compatibility with DDS-3.&lt;/p&gt;

&lt;h3 id=&quot;scsi-host-adapter&quot;&gt;SCSI host adapter&lt;/h3&gt;

&lt;p&gt;As all three tape drives have parallel SCSI&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;8&lt;/a&gt;&lt;/sup&gt; connectors, I needed a
SCSI host adapter (“SCSI card”) to connect them to the forensic
workstation. Used SCSI cards can be easily found online, and they are
usually sold cheaply. Nevertheless, finding a model that was compatible
with both our workstation and the tape drives turned out to be somewhat
complicated. This is due to a number of reasons.&lt;/p&gt;

&lt;p&gt;First, SCSI cards often have interfaces that are not compatible with
current hardware. Many older models have a conventional PCI
interface&lt;sup id=&quot;fnref:9&quot;&gt;&lt;a href=&quot;#fn:9&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;9&lt;/a&gt;&lt;/sup&gt;, but PCI has been largely replaced by PCI Express&lt;sup id=&quot;fnref:10&quot;&gt;&lt;a href=&quot;#fn:10&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;10&lt;/a&gt;&lt;/sup&gt; on
modern motherboards and desktop machines. Some cards have a 64-bit PCI
interface, which is only compatible with enterprise servers.&lt;/p&gt;

&lt;p&gt;Even if the interface is compatible, the physical dimensions of the card
can cause further complications. Older “full-height” PCI Express cards
will not fit into a “low-profile” (also known as “half-height”) slot,
and vice versa (most modern machines only support “low-profile” cards).
Many cards were originally sold with both a “full-height” and a
“low-profile” bracket, which allows one to easily change the bracket to
fit the target machine. Buying second-hand, it is not uncommon to find
that either of the original brackets are missing.&lt;/p&gt;

&lt;p&gt;Online sellers do not always explicitly mention characteristics like
these, and even if they do this information is not necessarily accurate.
A useful resource in this regard is the web site of the Microsemi
company, which has the technical specifications of the full range of
Adaptec SCSI adapters&lt;sup id=&quot;fnref:11&quot;&gt;&lt;a href=&quot;#fn:11&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;11&lt;/a&gt;&lt;/sup&gt;. Figure 3 shows the PCI Express host adapter
that we are using in our workstation.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/scsi-controller.jpg&quot; alt=&quot;PCI Express SCSI host adapter&quot; /&gt;
  &lt;figcaption&gt;Figure 3: PCI Express SCSI host adapter&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;scsi-connectors-and-terminators&quot;&gt;SCSI connectors and terminators&lt;/h3&gt;

&lt;p&gt;Rather than being one well-defined standard, parallel SCSI is actually a
family of related standards that comprise a host of different
interfaces, not all of which are mutually compatible&lt;sup id=&quot;fnref:12&quot;&gt;&lt;a href=&quot;#fn:12&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;12&lt;/a&gt;&lt;/sup&gt;, &lt;sup id=&quot;fnref:13&quot;&gt;&lt;a href=&quot;#fn:13&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;13&lt;/a&gt;&lt;/sup&gt;. None
of these standards specify what connectors should be used to
interconnect SCSI devices. Over time, this has resulted in a myriad of
connector types that have been developed by different
manufacturers&lt;sup id=&quot;fnref:14&quot;&gt;&lt;a href=&quot;#fn:14&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;14&lt;/a&gt;&lt;/sup&gt;. These are typically identified by multiple names. As
an example, the commonly used 68-pin “DB68” connector is also referred
to as “MD68”, “High-Density”, “HD 68”, “Half-Pitch” and “HP68”, whereas
the “50-contact, Centronics-type” connector is alternatively known as a
“SCSI-1” or “Alternative 2, A-cable connector”. This complicates both
identifying the connector type on a particular device, as well as
finding suitable adapter plugs and cables. For identifying a connector,
the web site of Paralan provides a useful illustrated overview of the
most common types&lt;sup id=&quot;fnref:15&quot;&gt;&lt;a href=&quot;#fn:15&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;15&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;If the tape reader is the last device at either end of the SCSI chain,
it must be fitted with a “terminator”&lt;sup id=&quot;fnref:16&quot;&gt;&lt;a href=&quot;#fn:16&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;16&lt;/a&gt;&lt;/sup&gt;, which is a resistor circuit
that prevents the electrical signal from reflecting back from the ends
of the bus. Without a terminator, the tape drive will not work properly,
or, more likely, it will not work at all. External SCSI devices like our
tape drives use terminator plugs, as shown in Figure 4. For
internal devices, termination is often achieved through jumper settings,
or by physically removing the terminating resistors from their
sockets&lt;sup id=&quot;fnref:17&quot;&gt;&lt;a href=&quot;#fn:17&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;17&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/terminator.jpg&quot; alt=&quot;SCSI terminator attached to DLT-IV drive&quot; /&gt;
  &lt;figcaption&gt;Figure 4: SCSI terminator attached to DLT-IV drive&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h3 id=&quot;cleaning-cartridges&quot;&gt;Cleaning cartridges&lt;/h3&gt;

&lt;p&gt;Over time, the heads of a tape drive will get dirty due to a gradual
accumulation of dust, and sometimes also residue from the tapes that are
used. As this ultimately results in read errors, it is important to
periodically clean the drive with a dedicated cleaning cartridge. Most
drives have an indicator that lights up when cleaning is needed. The
cleaning procedure is usually very simple, and involves nothing more
than inserting the cleaning cartridge into the machine, which then
automatically starts the cleaning cycle. A single cleaning cartridge can
be used multiple (typically about 50) times. Although I was able to
purchase cleaning cartridges for both the DDS and the DLT-IV drives
online, it is unclear whether new cartridges are still manufactured
today. Since it is not easy or even recommended to clean these drives
manually (in fact this is likely to result in damage), the availability
of cleaning cartridges could be a concern for keeping older tape formats
like these accessible in the long run.&lt;/p&gt;

&lt;h2 id=&quot;software&quot;&gt;Software&lt;/h2&gt;

&lt;p&gt;Once the hardware is set up, a number of options are available for
reading the data from the tapes. Often, tapes contain backup archives
that were written by backup utilities such as &lt;em&gt;tar&lt;/em&gt;, &lt;em&gt;cpio&lt;/em&gt;, &lt;em&gt;dump&lt;/em&gt; or
&lt;em&gt;NTBackup&lt;/em&gt;, to name but a few. One approach would be to restore the
contents of each tape using the original software that was used to write
it. Even though many of these utilities are still available today
(especially the Unix-based ones), this approach is not a practical one.
First of all, it would require prior knowledge of the tape’s archive
format. Although we may sometimes have this knowledge (e.g., the writing
on a tape’s label may indicate it was created with the &lt;em&gt;tar&lt;/em&gt; utility),
in practice we often simply don’t know how a tape was written at all.
Also, this approach would complicate things, because each format would
require its own custom workflow. Finally, it would not work with formats
for which the original software is not readily available on the forensic
workstation (e.g. the Microsoft Tape Format that was used by Windows
&lt;em&gt;NTBackup&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;A better approach is to use tools like &lt;em&gt;dd&lt;/em&gt;&lt;sup id=&quot;fnref:18&quot;&gt;&lt;a href=&quot;#fn:18&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;18&lt;/a&gt;&lt;/sup&gt; which are able to read
data directly at the block device level. This way, tapes can be read in
a format-agnostic manner. The general idea here is that we use &lt;em&gt;dd&lt;/em&gt; to
read all archive files on a tape, irrespective of their format. We then
use format identification tools such as &lt;em&gt;file(1)&lt;/em&gt;&lt;sup id=&quot;fnref:19&quot;&gt;&lt;a href=&quot;#fn:19&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;19&lt;/a&gt;&lt;/sup&gt;, &lt;em&gt;Apache
Tika&lt;/em&gt;&lt;sup id=&quot;fnref:20&quot;&gt;&lt;a href=&quot;#fn:20&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;20&lt;/a&gt;&lt;/sup&gt;, &lt;em&gt;Fido&lt;/em&gt;&lt;sup id=&quot;fnref:21&quot;&gt;&lt;a href=&quot;#fn:21&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;21&lt;/a&gt;&lt;/sup&gt; or &lt;em&gt;Siegfried&lt;/em&gt;&lt;sup id=&quot;fnref:22&quot;&gt;&lt;a href=&quot;#fn:22&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;22&lt;/a&gt;&lt;/sup&gt; to establish the format of
each archive file, and subsequently use dedicated, format-specific
utilities to extract their contents. This is similar to existing
forensic workflows that are used for other carrier types in e.g.
&lt;em&gt;BitCurator&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;reading-a-tape-with-dd-and-mt&quot;&gt;Reading a tape with dd and mt&lt;/h3&gt;

&lt;p&gt;In the simplest case, reading data from a tape involves nothing more
than a &lt;em&gt;dd&lt;/em&gt; command line such as this one:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;dd &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/nst0 &lt;span class=&quot;nv&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;file0001.dd &lt;span class=&quot;nv&quot;&gt;bs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;16384
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the “&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;if&lt;/code&gt;” argument tells &lt;em&gt;dd&lt;/em&gt; to read input from the non-rewind
block device &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/dev/nst0&lt;/code&gt;, and the value of “&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;of&lt;/code&gt;” defines the file where
output is written. The “&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bs&lt;/code&gt;” argument defines a block size (here in
bytes), and this is where the first complication arises. The actual
value that must be used here depends on the software that was used to
write the tape, and its settings. If &lt;em&gt;dd&lt;/em&gt; is invoked with a value that
is too small, it will fail with a “cannot allocate memory” error. After
some experimentation I was able to establish the block size using the
following iterative procedure:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Starting with a block size of 512 bytes, try to read one single
record (and direct the output to the null device, as we don’t need
it):&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;dd &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/nst0 &lt;span class=&quot;nv&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/null &lt;span class=&quot;nv&quot;&gt;bs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;512 &lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Position the tape 1 record backward using the &lt;em&gt;mt&lt;/em&gt;&lt;sup id=&quot;fnref:23&quot;&gt;&lt;a href=&quot;#fn:23&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;23&lt;/a&gt;&lt;/sup&gt; tool (this
resets the read position to the start of the current session):&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mt &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; /dev/nst0 bsr 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If step 1 raised an error in &lt;em&gt;dd&lt;/em&gt;, increase the block size value by
512 bytes, and repeat from step 1.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Repeating these steps until &lt;em&gt;dd&lt;/em&gt; exits without errors will yield the
correct block size. Re-running the &lt;em&gt;dd&lt;/em&gt; command at the start of this
section with this value will recover the first session on the tape to a
single output file. This leads to a second complication: a tape may
contain additional sessions. We can test for this by positioning the
tape 1 record forward with the &lt;em&gt;mt&lt;/em&gt; tool:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mt &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; /dev/nst0 fsr 1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If the &lt;em&gt;mt&lt;/em&gt; call doesn’t result in an error (i.e. &lt;em&gt;mt&lt;/em&gt;’ s exit code
equals zero), at least one additional session exists. In that case we
use &lt;em&gt;mt&lt;/em&gt; again to position the tape 1 record backward (the start of the
second session). We then repeat the block-estimation procedure for the
second session, and read the data with &lt;em&gt;dd&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;All of the above steps are repeated until &lt;em&gt;mt&lt;/em&gt;’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fsr&lt;/code&gt; command results
in a non-zero exit code, which means no additional sessions exist. The
end result is that for each session on the tape we now have a
corresponding output file.&lt;/p&gt;

&lt;h3 id=&quot;tapeimgr&quot;&gt;Tapeimgr&lt;/h3&gt;

&lt;p&gt;Even though the above procedure is not particularly complicated, having
to go through all these steps by hand would be very cumbersome.
Moreover, &lt;em&gt;dd&lt;/em&gt;’s ability to overwrite entire block devices with one
single command introduces a high risk of accidental data loss (hence its
“destroy disk” nickname). Also, it would be useful to have a more
user-friendly method for reading data tapes. Because of these reasons, I
developed &lt;em&gt;tapeimgr&lt;/em&gt;&lt;sup id=&quot;fnref:24&quot;&gt;&lt;a href=&quot;#fn:24&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;24&lt;/a&gt;&lt;/sup&gt;, which is a software application that allows
one to read data tapes using a simple, easy-to-use graphical user
interface. Written in Python, it was loosely inspired by the popular
&lt;em&gt;Guymager&lt;/em&gt; forensic imaging software&lt;sup id=&quot;fnref:25&quot;&gt;&lt;a href=&quot;#fn:25&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;25&lt;/a&gt;&lt;/sup&gt;. Internally, &lt;em&gt;tapeimgr&lt;/em&gt; just
wraps around &lt;em&gt;dd&lt;/em&gt; and &lt;em&gt;mt&lt;/em&gt;, but it completely hides the complexities of
these tools from the user. The software runs on any Linux distribution,
and can be installed with &lt;em&gt;pip&lt;/em&gt;, Python’s default package manager. Its
only dependencies are a recent version of &lt;em&gt;Python&lt;/em&gt; (3.2 or more recent),
the &lt;em&gt;TkInter&lt;/em&gt; package, and &lt;em&gt;dd&lt;/em&gt; and &lt;em&gt;mt&lt;/em&gt;. All of these are present by
default on most modern Linux distros.&lt;/p&gt;

&lt;p&gt;Figure 5 shows &lt;em&gt;tapeimgr&lt;/em&gt;’s interface.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/tapeimgr.png&quot; alt=&quot;The tapeimgr interface&quot; /&gt;
  &lt;figcaption&gt;Figure 5: The tapeimgr interface&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;At the very minimum, a user must select a
directory to which all output for a given tape is written. If necessary,
the read process can be further customized using a number of options
that are described in detail in the &lt;em&gt;tapeimgr&lt;/em&gt; documentation. There are
also entry fields for descriptive metadata, and the values that are
entered here are written to a metadata file in JSON format. This file
also contains some basic event and technical metadata, including SHA-512
checksums of each session (which is represented as a file) that is read
from the tape. After the user presses the &lt;em&gt;Start&lt;/em&gt; button, the progress
of the tape reading procedure can be monitored from a widget at the
bottom of the interface; the information displayed here is also written
to a log file. When &lt;em&gt;tapeimgr&lt;/em&gt; has finished reading a tape, it displays
a prompt that tells the user whether the read process completed
successfully. In case of any problems, the log file contains detailed
information about all steps in the tape reading process. In addition to
the graphical user interface, &lt;em&gt;tapeimgr&lt;/em&gt; also has a command-line
interface, which makes it possible to integrate it into other
applications.&lt;/p&gt;

&lt;h3 id=&quot;limitations-of-tapeimgr&quot;&gt;Limitations of tapeimgr&lt;/h3&gt;

&lt;p&gt;At this stage, &lt;em&gt;tapeimgr&lt;/em&gt; has two important limitations. First, it only
supports tapes for which the block size is constant within each session.
More recent tape drives are often capable of writing tapes in “variable
block” mode, where the block size within a session varies. This is not
supported, although a possible (but so far untested) workaround may be
to set the Initial Block Size to some arbitrary large value that is
equal to or larger than the overall largest block size on the tape&lt;sup id=&quot;fnref:26&quot;&gt;&lt;a href=&quot;#fn:26&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;26&lt;/a&gt;&lt;/sup&gt;.
Variations in block size &lt;em&gt;between&lt;/em&gt; sessions are no problem, and are
fully supported. Also, &lt;em&gt;tapeimgr&lt;/em&gt; is not able to recover data from tapes
that were partially overwritten. As an example, suppose that a tape
originally contained 3 sessions with a size of 200 MB each. If someone
later overwrote part of the first session with a 10 MB session at the
start of that tape, running the tape through &lt;em&gt;tapeimgr&lt;/em&gt; will only
recover the first 10 MB session, and ignore the remaining sessions. The
reason for this is that each write action adds an “End Of Media” (EOM)
marker just beyond the end of the written data, and once an EOM is
written, any previously recorded data beyond that point are no longer
accessible (reportedly workarounds exist, but these are specific to
kernel drivers)&lt;sup id=&quot;fnref:27&quot;&gt;&lt;a href=&quot;#fn:27&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;27&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;reading-the-xxlink-tapes&quot;&gt;Reading the xxLINK tapes&lt;/h2&gt;

&lt;p&gt;With all the hardware and software in place, I first experimented with
reading some unimportant, disposable DDS-1 and DLT-IV test tapes. Based
on these tests I designed processing workflows, which I then documented
by creating detailed descriptions that cover all steps that have to be
followed to read a tape&lt;sup id=&quot;fnref:28&quot;&gt;&lt;a href=&quot;#fn:28&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;28&lt;/a&gt;&lt;/sup&gt;. Once I was confident that the workflows
were sufficiently robust, I applied them to the &lt;em&gt;xxLINK&lt;/em&gt; tapes. All but
one of the DDS-1 tapes could be read without problems. For one tape, the
recovery resulted in a 10-kB file with only null bytes, which most
likely means the tape is faulty. The two DDS-3 tapes could be read
successfully as well. Most of these tapes contained multiple (up to 4)
sessions. Of the 14 DLT-IV tapes, only 7 could be read without problems.
For the remaining 7, the reading procedure resulted in a zero-length
file, which means the tape drive interprets them as empty. A common
characteristic of all “failed” DLT-IV tapes is that they were written at
40 GB capacity, whereas the other tapes were written at 35 GB capacity.
This is odd, as our DLT-IV drive does in fact support 40 GB capacity
tapes (this was confirmed by writing some data to a blank test tape at
40 GB capacity, which could subsequently be read without problems).
Although the exact cause is unknown at this stage, it is possible that
these tapes are simply faulty, or perhaps they were erased or
overwritten at some point. Interestingly, the label on at least one of
the problematic tapes contains some writing that suggests it was already
faulty around the time it was written. Table 1 gives a
brief summary of the above results.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt; &lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;DDS-1&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;DDS-3&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;DLT-IV&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;# tapes&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;17&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;14&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;# read successfully&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;16&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;7&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;em&gt;Table 1: Summary of xxLINK tapes read results&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;extraction-of-recovered-container-files&quot;&gt;Extraction of recovered container files&lt;/h2&gt;

&lt;p&gt;It is important to stress that the above &lt;em&gt;tapeimgr&lt;/em&gt;-based workflow only
recovers the contents of the tapes at the bit level: for each session on
the tape it results in one bitstream (file). Additional steps are needed
to interpret the recovered bitstreams in a meaningful way. At the very
the minimum two additional steps are necessary. First we need to
identify the format of the container files that were recovered from the
tapes. Once the container format is known, we can (hopefully) find some
format-specific software that is capable of extracting the content of
the container files. For format identification I ran the Unix &lt;em&gt;file(1)&lt;/em&gt;
command (v. 5.32)&lt;sup id=&quot;fnref:29&quot;&gt;&lt;a href=&quot;#fn:29&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;29&lt;/a&gt;&lt;/sup&gt; on all recovered files. The results are
summarized in Table 2.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Format (file(1))&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Number of files&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;new-fs dump file (big endian)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;28&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;new-fs dump file (little endian)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;8&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;tar archive&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;4&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;POSIX tar archive&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;2&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;POSIX tar archive (GNU)&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;5&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;&lt;em&gt;Table 2: Formats of recovered files according to file(1)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most files in the &lt;em&gt;xxLINK&lt;/em&gt; data set are Unix &lt;em&gt;dump&lt;/em&gt; archives&lt;sup id=&quot;fnref:30&quot;&gt;&lt;a href=&quot;#fn:30&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;30&lt;/a&gt;&lt;/sup&gt;.
&lt;em&gt;Dump&lt;/em&gt; is an old backup utility, and its archive files can be extracted
using the &lt;em&gt;restore&lt;/em&gt; tool&lt;sup id=&quot;fnref:31&quot;&gt;&lt;a href=&quot;#fn:31&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;31&lt;/a&gt;&lt;/sup&gt;. Even though &lt;em&gt;dump&lt;/em&gt; and &lt;em&gt;restore&lt;/em&gt; are
largely obsolete today, the software is still available in the Debian
repositories, and as a result these tools can be easily installed on
most Linux-based platforms. A few words of caution: first, by default
&lt;em&gt;restore&lt;/em&gt; extracts the contents of a &lt;em&gt;dump&lt;/em&gt; file to the system’s root
directory, i.e. it tries to recover a full backup. For our purposes this
behaviour is clearly unwanted, and could even wreak havoc on the
forensic workstation’s file system. However, extraction to a
user-defined directory is possible by running &lt;em&gt;restore&lt;/em&gt; in “interactive”
mode&lt;sup id=&quot;fnref:32&quot;&gt;&lt;a href=&quot;#fn:32&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;32&lt;/a&gt;&lt;/sup&gt;. A disadvantage of having to use the interactive mode is that
it makes bulk processing of &lt;em&gt;dump&lt;/em&gt; files virtually impossible. This
could be a serious problem when one has to deal with very large numbers
of these files. Second, it is important to check the file system of the
disk to which the container file is extracted. I initially tried to
extract the &lt;em&gt;dump&lt;/em&gt; files to an &lt;em&gt;NTFS&lt;/em&gt;-formatted&lt;sup id=&quot;fnref:33&quot;&gt;&lt;a href=&quot;#fn:33&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;33&lt;/a&gt;&lt;/sup&gt; external hard disk.
However, it turned out that the names of some files and directories
inside the archive were not compatible with &lt;em&gt;NTFS&lt;/em&gt;, and as a result
these files were not extracted. Since the &lt;em&gt;dump&lt;/em&gt; archives were
originally created from a Unix-based file system, this is not
surprising. Also, any file attributes that are not supported by &lt;em&gt;NTFS&lt;/em&gt;
(e.g. access permissions and ownership) are lost when extracting to
&lt;em&gt;NTFS&lt;/em&gt;. Extraction to another disk that was formatted as &lt;em&gt;Ext4&lt;/em&gt;&lt;sup id=&quot;fnref:34&quot;&gt;&lt;a href=&quot;#fn:34&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;34&lt;/a&gt;&lt;/sup&gt;
(which is the default file system for most Linux distributions) resolved
this issue.&lt;/p&gt;

&lt;p&gt;The remaining files are all &lt;em&gt;tar&lt;/em&gt; archives, a format that is still
widely used today. These files can be extracted by simply running the
&lt;em&gt;tar&lt;/em&gt;&lt;sup id=&quot;fnref:35&quot;&gt;&lt;a href=&quot;#fn:35&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;35&lt;/a&gt;&lt;/sup&gt; command like this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;tar&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-xvf&lt;/span&gt; /path/to/file0001.dd &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The earlier observations on the file system of the disk to which the
container is extracted also apply to &lt;em&gt;tar&lt;/em&gt; files.&lt;/p&gt;

&lt;p&gt;Using &lt;em&gt;restore&lt;/em&gt; and &lt;em&gt;tar&lt;/em&gt; I was able to successfully extract the
contents of all container files. For a small number of &lt;em&gt;dump&lt;/em&gt; files,
&lt;em&gt;restore&lt;/em&gt; reported errors about missing files that could not be found.
Although the exact cause is unknown at this stage, a possible
explanation could be that in these cases a single &lt;em&gt;dump&lt;/em&gt; was stored as
two volumes on separate physical tapes. This will need further
investigation. Nevertheless, overall the interpretation of the &lt;em&gt;xxLINK&lt;/em&gt;
tapes at the container level is quite straightforward.&lt;/p&gt;

&lt;p&gt;It is worth noting that the extraction may be more complicated for other
container formats. For example, a number of Microsoft backup tools for
the Windows platform (e.g. &lt;em&gt;NTBackup&lt;/em&gt;) used to write data to tape using
the &lt;em&gt;Microsoft Tape Format&lt;/em&gt;&lt;sup id=&quot;fnref:36&quot;&gt;&lt;a href=&quot;#fn:36&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;36&lt;/a&gt;&lt;/sup&gt;. This is a proprietary format that is
only officially supported by the original creator software, which is not
freely available, and only runs under Windows (however, a few
open-source tools exist that claim to support the format). As the
&lt;em&gt;xxLINK&lt;/em&gt; data set does not include this format, it was not investigated
as part of this work.&lt;/p&gt;

&lt;h2 id=&quot;reconstruction-of-sites&quot;&gt;Reconstruction of sites&lt;/h2&gt;

&lt;p&gt;Once the contents of the container files are extracted, we can start
reconstructing the web sites. At this stage of the Web Archaeology
project we have only just made a start with this; however, it is
possible to present some first results. As a first step we need to
inspect the contents of the data that were extracted from the container
files, and identify any files and directories that contain web site
data. This includes not only any directories with the sites’ source
data, but also server configuration files, which contain valuable
information about the sites. For instance, from the configuration files
it is possible to see at which domains and URLs they were originally
hosted, and how internal forwards were handled. With this information it
is possible to host any of the old sites on a locally running web server
at their original domains. A detailed discussion of the technical
details is beyond the scope of this paper, but the general approach is
similar to the one we used earlier to reconstruct the &lt;em&gt;NL-Menu&lt;/em&gt; web
index in 2018 &lt;a href=&quot;https://www.bitsgalore.org/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited&quot;&gt;[3]&lt;/a&gt;,
&lt;a href=&quot;https://www.bitsgalore.org/2018/07/11/crawling-offline-web-content-the-nl-menu-case&quot;&gt;[4]&lt;/a&gt;.
It comprises the following steps&lt;sup id=&quot;fnref:37&quot;&gt;&lt;a href=&quot;#fn:37&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;37&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Set up a web server (typically &lt;em&gt;Apache&lt;/em&gt;)&lt;sup id=&quot;fnref:38&quot;&gt;&lt;a href=&quot;#fn:38&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;38&lt;/a&gt;&lt;/sup&gt;, and restrict access
to the server to &lt;em&gt;localhost&lt;/em&gt; (this ensures that any hosted sites are
only accessible locally on the machine on which the server is
running).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Copy the contents of the site (i.e. its directory tree) to the
default root folder used by the web server (typically &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/var/www&lt;/code&gt;),
and update the file permissions.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Configure the site by creating a configuration file (or by adding an
entry to an existing configuration file).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Activate the configuration file.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Add the site’s domain to the &lt;em&gt;hosts&lt;/em&gt; file (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/etc/hosts&lt;/code&gt;).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Restart the web server.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;After following these steps, the site is now locally accessible at its
original URL, and it can be viewed in a browser, or archived with web
crawler software such as &lt;em&gt;wget&lt;/em&gt;. Since the directory structure of the
web site data on the &lt;em&gt;xxLINK&lt;/em&gt; tapes is quite uniform, it was possible to
automate these steps to a large extent. Using this approach, I have so
far reconstructed about 20 sites from one of the 1995 DDS-1 tapes by
hosting them on an &lt;em&gt;Apache&lt;/em&gt; web server instance. A few examples will
illustrate the diversity of the sites in the &lt;em&gt;xxLINK&lt;/em&gt; data set. Figure
6 shows the home page of &lt;em&gt;xxLINK&lt;/em&gt;’s web site.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/xxlink.png&quot; alt=&quot;xxLINK home page&quot; /&gt;
  &lt;figcaption&gt;Figure 6: xxLINK home page&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Figure 7 shows a snapshot of the home page of the web site of Schiphol Airport, which
pre-dates the earliest snapshot of this site in Internet Archive&lt;sup id=&quot;fnref:39&quot;&gt;&lt;a href=&quot;#fn:39&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;39&lt;/a&gt;&lt;/sup&gt; by
several months.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/schiphol.png&quot; alt=&quot;Home page of Schiphol Airport&quot; /&gt;
  &lt;figcaption&gt;Figure 7: Home page of Schiphol Airport&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Figure 8 shows a report on drugs policy in the Netherlands,
which was published as part of the site of the Dutch Ministry of Health,
Welfare and Sport.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/vwsdrugs.png&quot; alt=&quot;Report on Drugs policy in the Netherlands on web site of Dutch Ministry of Health, Welfare and Sport&quot; /&gt;
  &lt;figcaption&gt;Figure 8: Report on Drugs policy in the Netherlands on web site of Dutch Ministry of Health, Welfare and Sport&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Finally, Figure 9 shows a
contest published on the site of Dutch publisher Database Publications.
The objective of the contest was to correctly identify the web addresses
of the home pages shown in the image; free copies of &lt;em&gt;CorelDraw 5.0&lt;/em&gt;
were available to five lucky winners.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/09/prijsvraag.png&quot; alt=&quot;Home page identification contest on web site of Database Publications&quot; /&gt;
  &lt;figcaption&gt;Figure 9: Home page identification contest on web site of Database Publications&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The site reconstruction procedure will most likely need further
refinement. For instance, most of the sites on the 1995 tape are
relatively simple static HTML sites, but a few include forms that use
CGI scripts, which currently do not work. Also, it is possible that the
sites on the more recent (1999) tapes are more complex, but this needs
further investigation. Once the analysis and processing of the data from
the remaining tapes is complete, a more in-depth report on the
reconstruction procedure will be published separately.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;In this paper I showed how old DDS and DLT-IV tapes from the ’90s can be
read on a modern desktop workstation running Linux (in this case the
Ubuntu-based &lt;em&gt;BitCurator&lt;/em&gt; environment). I also explained how I created
workflows that allow one to recover data from these tapes in a
format-agnostic way, using the user-friendly &lt;em&gt;tapeimgr&lt;/em&gt; software.
Finally I discussed how I then extracted the contents of the resulting
files, and how I used this to reconstruct a number of “lost” web sites
from 1995. The workflow descriptions are available on Github&lt;sup id=&quot;fnref:40&quot;&gt;&lt;a href=&quot;#fn:40&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;40&lt;/a&gt;&lt;/sup&gt;, and
they will most likely evolve further over time. They are published under
a permissive license that allows anyone to adapt them and create
derivatives. They describe all aspects of the tape reading process in
detail, including the hardware components used and their
characteristics, links to relevant documentation, instructions on how to
handle the tapes (e.g. how to write-protect them), how to operate the
tape readers, and how to use them in conjunction with &lt;em&gt;tapeimgr&lt;/em&gt;. The
level of detail provided should be sufficient to allow others to
reproduce these workflows, and adapt them to their needs if necessary.
Since the process of reading data tapes on Linux-based systems is quite
standardized, other tape formats that are not covered by this paper can
probably be processed in a similar way.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgments&quot;&gt;Acknowledgments&lt;/h2&gt;

&lt;p&gt;First of all thanks are due to Wendy van Dijk and Elizabeth Mattijsen
for making the &lt;em&gt;xxLINK&lt;/em&gt; tapes available to us, and to Kees Teszelszky,
Peter de Bode and Jasper Faase for initiating the Web Archaeology
project, and establishing the contact with &lt;em&gt;xxLINK&lt;/em&gt;. Peter Boel and René
van Egdom are thanked for their help digging out the tape drives and
other obscure hardware peripherals, and Willem Jan Faber for various
helpful hardware-related suggestions. Finally, thanks are due to the
anonymous reviewers who provided valuable feedback to an earlier draft
of this paper.&lt;/p&gt;

&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;

&lt;p&gt;[1]  B. Sierman and K. Teszelszky, “How can we improve our web
    collection? an evaluation of webarchiving at the KB National
    Library of the Netherlands (2007-2017),” &lt;em&gt;Alexandria: The Journal
    of National and International Library and Information
    Issues&lt;/em&gt;, Aug. 2017. DOI: 10.1177/0955749017725930. [Online]. Available: &lt;a href=&quot;https://doi.org/10.1177/0955749017725930&quot;&gt;https://doi.org/10.1177/0955749017725930&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[2]  S. Ross and A. Gow, “Digital archaeology: Rescuing neglected and
    damaged data resources. a jisc/npo study within the electronic
    libraries (elib) programme on the preservation of electronic materials.,” University of Glasgow, 1999. [Online]. 
    Available: &lt;a href=&quot;http://www.ukoln.ac.uk/services/elib/papers/supporting/pdf/p2.pdf&quot;&gt;http://www.ukoln.ac.uk/services/elib/papers/supporting/pdf/p2.pdf&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[3]  J. van der Knijff, &lt;em&gt;Resurrecting the first Dutch web
    index: NL-menu revisited&lt;/em&gt;, 2018. [Online]. Available: &lt;a href=&quot;https://www.bitsgalore.org/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited&quot;&gt;https://www.bitsgalore.org/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[4]  ——, &lt;em&gt;Crawling offline web content: The NL-menu
    case&lt;/em&gt;, 2018. [Online]. Available: &lt;a href=&quot;https://www.bitsgalore.org/2018/07/11/crawling-offline-web-content-the-nl-menu-case&quot;&gt;https://www.bitsgalore.org/2018/07/11/crawling-offline-web-content-the-nl-menu-case&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[5]  D. Elford, N. Del Pozo, S. Mihajlovic, D. Pearson, G. Clifton,
    and C. Webb, “Media matters: Developing processes for preserving
    digital objects on physical carriers at the national library of
    Australia,” in &lt;em&gt;Proceedings, 74th IFLA General
    Conference and Council&lt;/em&gt;, 2008. [Online].
    Available: &lt;a href=&quot;http://archive.ifla.org/IV/ifla74/papers/084-Webb-en.pdf&quot;&gt;http://archive.ifla.org/IV/ifla74/papers/084-Webb-en.pdf&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[6]  K. Woods and G. Brown, “From imaging to access - effective
    preservation of legacy removable media,” in &lt;em&gt;Proceedings, Archiving
    2009&lt;/em&gt;, 2009. [Online]. Available: &lt;a href=&quot;https://kamwoods.net/publications/woodsbrownarch09.pdf&quot;&gt;https://kamwoods.net/publications/woodsbrownarch09.pdf&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[7]  K. Woods, C. Lee, and S. Garfinkel, “Extending digital repository
    architectures to support disk image preservation and access,” in
    &lt;em&gt;Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries&lt;/em&gt;,
    ser. JCDL ‘11, Ottawa, Ontario, Canada: ACM, 2011, pp. 57-66,
    ISBN: 978-1-4503- 0744-4. DOI: 10.1145/1998076.1998088. [Online].
    Available: &lt;a href=&quot;http://doi.acm.org/10.1145/1998076.1998088&quot;&gt;http://doi.acm.org/10.1145/1998076.1998088&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[8]  C. Lee, M. Kirschenbaum, A. Chassanoff, P. Olsen, and K. Woods,
    “BitCurator: Tools and techniques for digital forensics in
    collecting institutions,” &lt;em&gt;D-Lib Magazine&lt;/em&gt;, vol. 18, 5/6 2012.
    [Online]. Available: &lt;a href=&quot;http://www.dlib.org/dlib/may12/lee/05lee.html&quot;&gt;http://www.dlib.org/dlib/may12/lee/05lee.html&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[9]  J. John, “Digital forensics and preservation,” Digital
    Preservation Coalition, 2012. [Online].
    Available: &lt;a href=&quot;http://dx.doi.org/10.7207/twr12-03&quot;&gt;http://dx.doi.org/10.7207/twr12-03&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[10] M. Pennock, P. May, M. Day, K. Davies, S. Whibley, A. Kimura, and E.
    Halvarsson, “The flashback project: Rescuing disk-based content from
    the 1980’s to the current day,” in &lt;em&gt;Proceedings, 11th
    Digital Curation Conference&lt;/em&gt;, 2016. [Online]. Available:
    &lt;a href=&quot;https://doi.org/10.5281/zenodo.1321629&quot;&gt;https://doi.org/10.5281/zenodo.1321629&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[11] A. Prael and A. Wickner, “Getting to know FRED: Introducing
    workflows for born-digital content,” &lt;em&gt;Practical Technology for
    Archives&lt;/em&gt;, vol. 4, 2015. [Online]. Available: &lt;a href=&quot;https://practicaltechnologyforarchives.org/issue4_prael_wickner/&quot;&gt;https://practicaltechnologyforarchives.org/issue4_prael_wickner/&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[12] D. Salo, &lt;em&gt;Building audio, video and data-rescue kits&lt;/em&gt;, 2017.
    [Online]. Available: &lt;a href=&quot;https://radd.dsalo.info/wp-content/uploads/2017/10/BuildDocumentation.pdf&quot;&gt;https://radd.dsalo.info/wp-content/uploads/2017/10/BuildDocumentation.pdf&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[13] BitCurator Consortium, &lt;em&gt;Workflows&lt;/em&gt;. [Online].
    Available: &lt;a href=&quot;https://bitcuratorconsortium.org/workflows&quot;&gt;https://bitcuratorconsortium.org/workflows&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[14] T. De Haan, “Project the digital city revives, a case study of web
    archaeology,” &lt;em&gt;Proceedings of the 13th
    International Conference on Digital Preservation&lt;/em&gt;, 2016.
    [Online]. Available: &lt;a href=&quot;https://doi.org/10.5281/zenodo.1255965&quot;&gt;https://doi.org/10.5281/zenodo.1255965&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;[15] T. De Haan, R. Jansma, and P. Vogel, &lt;em&gt;Do it yourself handboek voor
    webarcheologie&lt;/em&gt;, 2017. [Online]. Available:
    &lt;a href=&quot;https://hart.amsterdam/image/2017/11/17/20171116_freeze_diy_handboek.pdf&quot;&gt;https://hart.amsterdam/image/2017/11/17/20171116_freeze_diy_handboek.pdf&lt;/a&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published in &lt;a href=&quot;https://ipres2019.org/static/pdf/iPres2019_paper_9.pdf&quot;&gt;Proceedings of the 16th
International Conference on Digital Preservation, 2019&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Elizabeth Mattijsen, old xxLINK-homepage:
&lt;a href=&quot;https://liz.nl/xxlink-homepage.htm&quot;&gt;https://liz.nl/xxlink-homepage.htm&lt;/a&gt; &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;See e.g. the links in the “Tapes” section at
&lt;a href=&quot;https://github.com/KBNLresearch/forensicImagingResources/blob/master/doc/df-resources.md&quot;&gt;https://github.com/KBNLresearch/forensicImagingResources/blob/master/doc/df-resources.md&lt;/a&gt; &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;BitCurator: &lt;a href=&quot;https://bitcurator.net/&quot;&gt;https://bitcurator.net/&lt;/a&gt; &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;“Digital Data Storage”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/Digital_Data_Storage&quot;&gt;https://en.wikipedia.org/wiki/Digital_Data_Storage&lt;/a&gt; &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;“HP StorageWorks DDS/DAT Media - DDS/DAT Media Compatibility
Matrix”, Hewlett Packard:
&lt;a href=&quot;https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-lpg50457&quot;&gt;https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-lpg50457&lt;/a&gt; &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;“Digital Linear Tape”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/Digital_Linear_Tape&quot;&gt;https://en.wikipedia.org/wiki/Digital_Linear_Tape&lt;/a&gt; &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;“DLT Drive Media and Cleaning Tape Compatibility Guide”,
TapeandMedia.com:
&lt;a href=&quot;https://www.tapeandmedia.com/dlt_capacity_info.asp&quot;&gt;https://www.tapeandmedia.com/dlt_capacity_info.asp&lt;/a&gt; &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;“Parallel SCSI”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/Parallel_SCSI&quot;&gt;https://en.wikipedia.org/wiki/Parallel_SCSI&lt;/a&gt; &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:9&quot;&gt;
      &lt;p&gt;“Conventional PCI”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/Conventional_PCI&quot;&gt;https://en.wikipedia.org/wiki/Conventional_PCI&lt;/a&gt; &lt;a href=&quot;#fnref:9&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:10&quot;&gt;
      &lt;p&gt;“PCI Express”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/PCI_Express&quot;&gt;https://en.wikipedia.org/wiki/PCI_Express&lt;/a&gt; &lt;a href=&quot;#fnref:10&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:11&quot;&gt;
      &lt;p&gt;“Adaptec Support”, Microsemi:
&lt;a href=&quot;https://storage.microsemi.com/en-us/support/scsi/&quot;&gt;https://storage.microsemi.com/en-us/support/scsi/&lt;/a&gt; &lt;a href=&quot;#fnref:11&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:12&quot;&gt;
      &lt;p&gt;“Parallel SCSI”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/Parallel_SCSI&quot;&gt;https://en.wikipedia.org/wiki/Parallel_SCSI&lt;/a&gt; &lt;a href=&quot;#fnref:12&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:13&quot;&gt;
      &lt;p&gt;“LVD, SE, HVD, SCSI Compatibility - Or Not”, Paralan:
&lt;a href=&quot;http://www.paralan.com/scsiexpert.html&quot;&gt;http://www.paralan.com/scsiexpert.html&lt;/a&gt; &lt;a href=&quot;#fnref:13&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:14&quot;&gt;
      &lt;p&gt;“SCSI connector”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/SCSI_connector&quot;&gt;https://en.wikipedia.org/wiki/SCSI_connector&lt;/a&gt; &lt;a href=&quot;#fnref:14&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:15&quot;&gt;
      &lt;p&gt;“What kind of SCSI do I have?”, Paralan:
&lt;a href=&quot;http://www.paralan.com/sediff.html&quot;&gt;http://www.paralan.com/sediff.html&lt;/a&gt; &lt;a href=&quot;#fnref:15&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:16&quot;&gt;
      &lt;p&gt;“Parallel SCSI”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/Parallel_SCSI&quot;&gt;https://en.wikipedia.org/wiki/Parallel_SCSI&lt;/a&gt; &lt;a href=&quot;#fnref:16&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:17&quot;&gt;
      &lt;p&gt;“SCSI termination Q&amp;amp;A”, Adaptec:
&lt;a href=&quot;https://storage.microsemi.com/en-us/support/scsi/3940/aha-3940uwd/hw_install/scsi_termination.htm&quot;&gt;https://storage.microsemi.com/en-us/support/scsi/3940/aha-3940uwd/hw_install/scsi_termination.htm&lt;/a&gt; &lt;a href=&quot;#fnref:17&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:18&quot;&gt;
      &lt;p&gt;“dd (Unix)”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/Dd_%28Unix%29&quot;&gt;https://en.wikipedia.org/wiki/Dd_%28Unix%29&lt;/a&gt; &lt;a href=&quot;#fnref:18&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:19&quot;&gt;
      &lt;p&gt;“file (command)”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/File_(command)&quot;&gt;https://en.wikipedia.org/wiki/File_(command)&lt;/a&gt; &lt;a href=&quot;#fnref:19&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:20&quot;&gt;
      &lt;p&gt;Apache Tika: &lt;a href=&quot;https://tika.apache.org/&quot;&gt;https://tika.apache.org/&lt;/a&gt; &lt;a href=&quot;#fnref:20&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:21&quot;&gt;
      &lt;p&gt;Fido: &lt;a href=&quot;http://fido.openpreservation.org/&quot;&gt;http://fido.openpreservation.org/&lt;/a&gt; &lt;a href=&quot;#fnref:21&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:22&quot;&gt;
      &lt;p&gt;Siegfried: &lt;a href=&quot;https://www.itforarchivists.com/siegfried/&quot;&gt;https://www.itforarchivists.com/siegfried/&lt;/a&gt; &lt;a href=&quot;#fnref:22&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:23&quot;&gt;
      &lt;p&gt;“mt(1) - Linux man page”, die.net:
&lt;a href=&quot;https://linux.die.net/man/1/mt&quot;&gt;https://linux.die.net/man/1/mt&lt;/a&gt; &lt;a href=&quot;#fnref:23&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:24&quot;&gt;
      &lt;p&gt;“Tapeimgr”: &lt;a href=&quot;https://github.com/KBNLresearch/tapeimgr&quot;&gt;https://github.com/KBNLresearch/tapeimgr&lt;/a&gt; &lt;a href=&quot;#fnref:24&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:25&quot;&gt;
      &lt;p&gt;“Guymager homepage”: &lt;a href=&quot;https://guymager.sourceforge.io/&quot;&gt;https://guymager.sourceforge.io/&lt;/a&gt; &lt;a href=&quot;#fnref:25&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:26&quot;&gt;
      &lt;p&gt;“ “Cannot allocate memory” when reading from SCSI tape”, Unix
Stack Exchange: &lt;a href=&quot;https://unix.stackexchange.com/a/366217&quot;&gt;https://unix.stackexchange.com/a/366217&lt;/a&gt; &lt;a href=&quot;#fnref:26&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:27&quot;&gt;
      &lt;p&gt;“Tape Driver Semantics”, Amanda Wiki:
&lt;a href=&quot;https://wiki.zmanda.com/index.php/Tape_Driver_Semantics&quot;&gt;https://wiki.zmanda.com/index.php/Tape_Driver_Semantics&lt;/a&gt; &lt;a href=&quot;#fnref:27&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:28&quot;&gt;
      &lt;p&gt;“KB Forensic Imaging Resources”:
&lt;a href=&quot;https://github.com/KBNLresearch/forensicImagingResources/tree/master/doc&quot;&gt;https://github.com/KBNLresearch/forensicImagingResources/tree/master/doc&lt;/a&gt; &lt;a href=&quot;#fnref:28&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:29&quot;&gt;
      &lt;p&gt;“file (command)”, Wikipedia:
&lt;a href=&quot;https://en.wikipedia.org/wiki/File_(command)&quot;&gt;https://en.wikipedia.org/wiki/File_(command)&lt;/a&gt; &lt;a href=&quot;#fnref:29&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:30&quot;&gt;
      &lt;p&gt;“Unix dump”, ArchiveTeam File Formats Wiki:
&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Unix_dump&quot;&gt;http://fileformats.archiveteam.org/wiki/Unix_dump&lt;/a&gt; &lt;a href=&quot;#fnref:30&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:31&quot;&gt;
      &lt;p&gt;“restore(8) - Linux man page”,
die.net: &lt;a href=&quot;https://linux.die.net/man/8/restore&quot;&gt;https://linux.die.net/man/8/restore&lt;/a&gt; &lt;a href=&quot;#fnref:31&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:32&quot;&gt;
      &lt;p&gt;A step-by-step description can be found here:
&lt;a href=&quot;https://github.com/KBNLresearch/forensicImagingResources/blob/master/doc/extract-dumpfile.md&quot;&gt;https://github.com/KBNLresearch/forensicImagingResources/blob/master/doc/extract-dumpfile.md&lt;/a&gt; &lt;a href=&quot;#fnref:32&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:33&quot;&gt;
      &lt;p&gt;“NTFS”, Wikipedia: &lt;a href=&quot;https://en.wikipedia.org/wiki/NTFS&quot;&gt;https://en.wikipedia.org/wiki/NTFS&lt;/a&gt; &lt;a href=&quot;#fnref:33&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:34&quot;&gt;
      &lt;p&gt;“Ext4”, Wikipedia: &lt;a href=&quot;https://en.wikipedia.org/wiki/Ext4&quot;&gt;https://en.wikipedia.org/wiki/Ext4&lt;/a&gt; &lt;a href=&quot;#fnref:34&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:35&quot;&gt;
      &lt;p&gt;“tar(1) - Linux man page”,
die.net: &lt;a href=&quot;https://linux.die.net/man/1/tar&quot;&gt;https://linux.die.net/man/1/tar&lt;/a&gt; &lt;a href=&quot;#fnref:35&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:36&quot;&gt;
      &lt;p&gt;Microsoft Tape Format Specification Version 1.00a:
&lt;a href=&quot;http://laytongraphics.com/mtf/MTF_100a.PDF&quot;&gt;http://laytongraphics.com/mtf/MTF_100a.PDF&lt;/a&gt; &lt;a href=&quot;#fnref:36&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:37&quot;&gt;
      &lt;p&gt;These steps are described in more detail here:
&lt;a href=&quot;https://github.com/KBNLresearch/nl-menu-resources/blob/master/doc/serving-static-website-with-Apache.md&quot;&gt;https://github.com/KBNLresearch/nl-menu-resources/blob/master/doc/serving-static-website-with-Apache.md&lt;/a&gt; &lt;a href=&quot;#fnref:37&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:38&quot;&gt;
      &lt;p&gt;The Apache HTTP Server Project: &lt;a href=&quot;https://httpd.apache.org/&quot;&gt;https://httpd.apache.org/&lt;/a&gt; &lt;a href=&quot;#fnref:38&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:39&quot;&gt;
      &lt;p&gt;Link: &lt;a href=&quot;https://web.archive.org/web/19961018155616/http://www.schiphol.nl/&quot;&gt;https://web.archive.org/web/19961018155616/http://www.schiphol.nl/&lt;/a&gt; &lt;a href=&quot;#fnref:39&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:40&quot;&gt;
      &lt;p&gt;“KB Forensic Imaging Resources”:
&lt;a href=&quot;https://github.com/KBNLresearch/forensicImagingResources/tree/master/doc&quot;&gt;https://github.com/KBNLresearch/forensicImagingResources/tree/master/doc&lt;/a&gt; &lt;a href=&quot;#fnref:40&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2019/09/09/recovering-90s-data-tapes-experiences-kb-web-archaeology</link>
                <guid>https://bitsgalore.org/2019/09/09/recovering-90s-data-tapes-experiences-kb-web-archaeology</guid>
                <pubDate>2019-09-09T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>A simple disk imaging workflow tool</title>
                <description>&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/04/floppies.jpg&quot; alt=&quot;Photograph of SATA hard disk, USB Flash drive and 3.5 floppy disks&quot; /&gt;
  &lt;figcaption&gt;SATA hard disk, USB Flash drive and 3.5&quot; floppy disks&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;As I explained in the introduction of &lt;a href=&quot;/2019/01/31/roll-the-tape-recovering-90s-data-tapes-in-bitcurator&quot;&gt;this earlier blog post&lt;/a&gt;, as part of our ongoing web archaeology project we are currently developing workflows for reading data from a variety of physical carrier formats. After the earlier work on &lt;a href=&quot;/2019/01/31/roll-the-tape-recovering-90s-data-tapes-in-bitcurator&quot;&gt;data tapes&lt;/a&gt; and &lt;a href=&quot;/2019/03/22/a-simple-workflow-tool-for-imaging-optical-media-using-readom-and-ddrescue&quot;&gt;optical media&lt;/a&gt;, the next job was to image a small box with 3.5” floppy disks. Easy enough, and my first thought was to fire up &lt;a href=&quot;https://guymager.sourceforge.io/&quot;&gt;&lt;em&gt;Guymager&lt;/em&gt;&lt;/a&gt; and be done with it. This turned out to be less straightforward than expected, which led to the development of yet another workflow tool: &lt;a href=&quot;https://github.com/KBNLresearch/diskimgr&quot;&gt;&lt;em&gt;diskimgr&lt;/em&gt;&lt;/a&gt;. In the remainder of this post I will first show the issues I ran into with &lt;em&gt;Guymager&lt;/em&gt;, and then demonstrate how these issues are remedied by &lt;em&gt;diskimgr&lt;/em&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;guymager-workflow&quot;&gt;Guymager workflow&lt;/h2&gt;

&lt;p&gt;As I tried to image some floppies with &lt;em&gt;Guymager&lt;/em&gt;, I quickly ran into several problems, all of which involved the handling of user-added descriptive metadata. &lt;em&gt;Guymager&lt;/em&gt; does in fact accommodate for this, but the way it is implemented introduces some practical issues. To illustrate this, here’s &lt;em&gt;Guymager&lt;/em&gt;’s default entry form:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/04/guymager-entry.png&quot; alt=&quot;Screenshot of Guymager entry form, Expert Witness format&quot; /&gt;
  &lt;figcaption&gt;Guymager entry form, Expert Witness format&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The form provides various fields that can be used to enter descriptive metadata, but they are &lt;em&gt;only&lt;/em&gt; available if one chooses to write the disk image in &lt;a href=&quot;https://www.loc.gov/preservation/digital/formats/fdd/fdd000406.shtml&quot;&gt;&lt;em&gt;Expert Witness Format&lt;/em&gt;&lt;/a&gt;. Changing the format to &lt;em&gt;Linux dd raw image&lt;/em&gt; (which simply generates a raw copy of the imaged medium’s bytestream) disables these fields:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/04/guymager-dd.png&quot; alt=&quot;Screenshot of Guymager entry form, dd format&quot; /&gt;
  &lt;figcaption&gt;Guymager entry form, dd format&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;expert-witness-format&quot;&gt;Expert Witness Format&lt;/h2&gt;

&lt;p&gt;One possible solution would be to image the floppies to &lt;em&gt;Expert Witness Format&lt;/em&gt; (&lt;em&gt;EWF&lt;/em&gt;), in which case the entered descriptor fields are embedded in the disk image. However, I really don’t want to do this. First, saving to &lt;em&gt;EWF&lt;/em&gt; would make further processing of the disk image (e.g. mounting the file system, or attaching it to a virtual machine or emulator) more difficult, since it requires that the reading application (emulator, disk mount tool) supports not only the floppy’s native file system (typically &lt;a href=&quot;https://forensicswiki.org/wiki/FAT&quot;&gt;&lt;em&gt;FAT&lt;/em&gt;&lt;/a&gt; for floppies that were written with MS-DOS or Windows), but also the added &lt;em&gt;EWF&lt;/em&gt; layer. Also, while &lt;em&gt;EWF&lt;/em&gt;’s support for data compression can be  tremendously useful for imaging large hard disks, it is largely unnecessary for 1.5 MB floppies.&lt;/p&gt;

&lt;h2 id=&quot;guymager-metadata-entry&quot;&gt;Guymager metadata entry&lt;/h2&gt;

&lt;p&gt;Last but not least, &lt;em&gt;Guymager&lt;/em&gt;’s interface introduces another practical hurdle. When &lt;em&gt;Guymager&lt;/em&gt; is launched, it shows a list of available devices:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/04/guymager-mainscreen.png&quot; alt=&quot;Screenshot of Guymager startup screen with device list&quot; /&gt;
  &lt;figcaption&gt;Guymager startup screen with device list&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;The user then selects the device they want to image, after which the aforementioned data entry form pops up. However, for floppies the descriptive metadata that one would enter here is typically found on the label on the floppy itself, &lt;em&gt;which by this time is inaccessible because the floppy is inside the floppy drive&lt;/em&gt;!&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/04/floppy-label.jpg&quot; alt=&quot;Photograph of floppy with descriptive information on label&quot; /&gt;
  &lt;figcaption&gt;Floppy with descriptive information on label&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Of course there are workarounds to this (e.g. you could copy the information from the labels beforehand, put it in a text file and then paste it into &lt;em&gt;Guymager&lt;/em&gt;’s entry fields), but overall this would be pretty clumsy.&lt;/p&gt;

&lt;h2 id=&quot;linux-dd-raw&quot;&gt;Linux dd raw&lt;/h2&gt;

&lt;p&gt;An alternative solution would be to select the &lt;em&gt;Linux dd raw&lt;/em&gt; option in &lt;em&gt;Guymager&lt;/em&gt;, and then add the metadata by hand afterwards. Again (as I also argued in &lt;a href=&quot;/2019/03/22/a-simple-workflow-tool-for-imaging-optical-media-using-readom-and-ddrescue&quot;&gt;my previous blog post&lt;/a&gt;), this is pretty cumbersome, and prone to all sorts of errors.&lt;/p&gt;

&lt;h2 id=&quot;diskimgr&quot;&gt;Diskimgr&lt;/h2&gt;

&lt;p&gt;As the &lt;a href=&quot;https://github.com/KBNLresearch/omimgr&quot;&gt;&lt;em&gt;omimgr&lt;/em&gt;&lt;/a&gt; already solves the above problems for optical media, I simply used the code of &lt;em&gt;omimgr&lt;/em&gt; as a starting point, and adapted it into &lt;a href=&quot;https://github.com/KBNLresearch/diskimgr&quot;&gt;&lt;em&gt;diskimgr&lt;/em&gt;&lt;/a&gt;. &lt;em&gt;Diskimgr&lt;/em&gt; is a general-purpose disk imaging tool that can be used for a wide variety of digital media, such a floppy disks, USB Flash drives and hard disks. It provides a simple graphical user interface for the entry of descriptive metadata, and all entered and generated metadata are written to a &lt;em&gt;JSON&lt;/em&gt; file, along with the image file.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/04/diskimgr-1.png&quot; alt=&quot;Screenshot of diskimgr interface&quot; /&gt;
  &lt;figcaption&gt;Diskimgr interface&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;Internally &lt;em&gt;diskimgr&lt;/em&gt; wraps around &lt;a href=&quot;https://linux.die.net/man/1/dd&quot;&gt;&lt;em&gt;Unix dd&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;https://linux.die.net/man/1/ddrescue&quot;&gt;&lt;em&gt;ddrescue&lt;/em&gt;&lt;/a&gt;. The general workflow of &lt;em&gt;diskimgr&lt;/em&gt; is similar to the one employed by &lt;em&gt;omimgr&lt;/em&gt;: first it tries to read a user-defined medium with &lt;em&gt;dd&lt;/em&gt;. If &lt;em&gt;dd&lt;/em&gt; fails, it prompts the user to give it another try with &lt;em&gt;ddrescue&lt;/em&gt;. If &lt;em&gt;ddrescue&lt;/em&gt; was unable to recover all the data from the medium, additional &lt;em&gt;ddrescue&lt;/em&gt; passes may be run to further improve the result. As an example, the screenshot below was taken after a &lt;em&gt;diskimgr&lt;/em&gt; session with a damaged 3.5” floppy disk:&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/04/ddrescue-pass2.png&quot; alt=&quot;Screenshot of diskimgr after two ddrescue passes&quot; /&gt;
  &lt;figcaption&gt;Diskimgr after two ddrescue passes&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;p&gt;After an initial attempt to image this floppy with &lt;em&gt;dd&lt;/em&gt; failed with errors, a first pass with &lt;em&gt;ddrescue&lt;/em&gt; resulted in a 106 kB block of unreadable data. A second &lt;em&gt;ddrescue&lt;/em&gt; pass with the &lt;em&gt;Direct disc mode&lt;/em&gt; option switched on reduced the size of the unreadable block to a mere 512 bytes (one sector).&lt;/p&gt;

&lt;h2 id=&quot;metadata&quot;&gt;Metadata&lt;/h2&gt;

&lt;p&gt;Descriptive metadata, a SHA-512 checksum of the disk image, and a host of event metadata are written to a &lt;em&gt;JSON&lt;/em&gt; file in a format that is largely identical to the one used by &lt;em&gt;omimgr&lt;/em&gt;. Below is an example:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;acquisitionEnd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2019-04-09T13:10:40.503984+02:00&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;acquisitionStart&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2019-04-09T13:09:59.835833+02:00&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;autoRetry&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;blockDevice&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/dev/sdc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;checksumType&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SHA-512&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;checksums&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;ks.img&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;79a17d3fa536b8fa750257b01d05124dadb888f1171e9ca5cc3398a2c16de81b1687b52c70135b966409a723ef5f3960536a6e994847c5ebe7d5eaffefa62dc7&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;description&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;KS metingen origineel&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;diskimgrVersion&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;0.1.0b3&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;extension&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;img&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;identifier&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;cc630cda-5ab7-11e9-bc82-dc4a3e5f53bf&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;interruptedFlag&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;maxRetries&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;4&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;notes&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;prefix&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;ks&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;readCommandLine&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dd if=/dev/sdc of=/home/johan/test/6/ks.img bs=512 conv=notrunc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;readMethod&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;readMethodVersion&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dd (coreutils) 8.25&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;rescueDirectDiscMode&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;successFlag&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;main-uses&quot;&gt;Main uses&lt;/h2&gt;

&lt;p&gt;While &lt;em&gt;diskimgr&lt;/em&gt; can be used for virtually any kind of block device, including entire hard disks, it is not intended to be a full replacement for tools such as &lt;em&gt;Guymager&lt;/em&gt;. For instance, for imaging a 500 GB hard disk I’d probably still prefer &lt;em&gt;Guymager&lt;/em&gt;. However, for situations where one wants to image a large number of small-size media (such as floppies), &lt;em&gt;Guymager&lt;/em&gt; is less than ideal, and &lt;em&gt;diskimgr&lt;/em&gt; might be worth a try. It also provides a user-friendly alternative to using &lt;em&gt;dd&lt;/em&gt; and &lt;em&gt;ddrescue&lt;/em&gt; from the command-line.&lt;/p&gt;

&lt;h2 id=&quot;final-remarks&quot;&gt;Final remarks&lt;/h2&gt;

&lt;p&gt;Just like &lt;em&gt;tapeimgr&lt;/em&gt; and &lt;em&gt;omimgr&lt;/em&gt;, &lt;em&gt;diskimgr&lt;/em&gt; only works on Linux-based systems. Again, this is an initial release which has had limited testing, so use at your own risk. If you run into any issues, feel free to &lt;a href=&quot;https://github.com/KBNLresearch/diskimgr/issues&quot;&gt;report them here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;link-to-diskimgr&quot;&gt;Link to diskimgr&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Diskimgr&lt;/em&gt; and its documentation can be found here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/diskimgr&quot;&gt;&lt;em&gt;diskimgr - Simple workflow tool for imaging block devices&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2019/04/10/a-simple-disk-imaging-workflow-tool</link>
                <guid>https://bitsgalore.org/2019/04/10/a-simple-disk-imaging-workflow-tool</guid>
                <pubDate>2019-04-10T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>A simple workflow tool for imaging optical media using readom and ddrescue</title>
                <description>&lt;p&gt;In 2015 I wrote &lt;a href=&quot;/2015/11/13/preserving-optical-media-from-the-command-line&quot;&gt;a blog post on preserving optical media from the command-line&lt;/a&gt;. Among other things, it suggested a rudimentary workflow for imaging CD-ROMs and DVDs using the &lt;a href=&quot;http://linux.die.net/man/1/readom&quot;&gt;&lt;em&gt;readom&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;http://linux.die.net/man/1/ddrescue&quot;&gt;&lt;em&gt;ddrescue&lt;/em&gt;&lt;/a&gt; tools. Even though we now have a &lt;a href=&quot;/2017/06/19/image-and-rip-optical-media-like-a-boss&quot;&gt;highly automated workflow in place&lt;/a&gt; for bulk processing optical media from our deposit collection, &lt;em&gt;readom&lt;/em&gt; and &lt;em&gt;ddrescue&lt;/em&gt; still prove to be useful for various special cases that don’t quite fit into this workflow. The materials that we are currently receiving as part of our web archaeology activities are a good example. These are typically small sets of recordable CD-ROMs that are often quite old, and such discs are highly likely to be in less than perfect condition. For these cases a highly automated, &lt;a href=&quot;https://github.com/KBNLresearch/iromlab&quot;&gt;&lt;em&gt;iromlab&lt;/em&gt;&lt;/a&gt;-like workflow is unnecessary, and to some degree even impractical. Nevertheless, it would be useful to have &lt;em&gt;some&lt;/em&gt; degree of automation, especially for things like the addition and packaging of associated metadata. This prompted the development of the &lt;a href=&quot;https://github.com/KBNLresearch/omimgr&quot;&gt;&lt;em&gt;omimgr&lt;/em&gt;&lt;/a&gt; workflow tool. In the the remainder of this blog post I will give an overview of &lt;em&gt;omimgr&lt;/em&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;the-command-line-workflow&quot;&gt;The command-line workflow&lt;/h2&gt;

&lt;p&gt;In &lt;a href=&quot;/2015/11/13/preserving-optical-media-from-the-command-line&quot;&gt;my 2015 blog post&lt;/a&gt; I reviewed a number of command-line imaging tools for optical media. For CD-ROMs and DVDs I recommended to first try imaging the disc with &lt;em&gt;readom&lt;/em&gt;, and use &lt;em&gt;ddrescue&lt;/em&gt; in case &lt;em&gt;readom&lt;/em&gt; fails. The logic behind this is that &lt;em&gt;readom&lt;/em&gt; was specifically designed for reading optical media, which makes it preferable over generic block device recovery tools such as &lt;a href=&quot;https://guymager.sourceforge.io/&quot;&gt;&lt;em&gt;Guymager&lt;/em&gt;&lt;/a&gt;, &lt;a href=&quot;http://linux.die.net/man/1/dd&quot;&gt;&lt;em&gt;dd&lt;/em&gt;&lt;/a&gt; or &lt;em&gt;ddrescue&lt;/em&gt;. However, &lt;em&gt;readom&lt;/em&gt; gives up rather easily on discs that are damaged or otherwise degraded, and for these cases &lt;em&gt;ddrescue&lt;/em&gt; is often capable of recovering surprising amounts of data. For example, our earlier success at &lt;a href=&quot;/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited&quot;&gt;resurrecting the first Dutch web index&lt;/a&gt; was largely thanks to &lt;em&gt;ddrescue&lt;/em&gt;’s ability to work its magic on the degraded CD-recordable that contained the source data. So, a basic command-line workflow that is based on &lt;em&gt;readom&lt;/em&gt; and &lt;em&gt;ddrescue&lt;/em&gt; would look like this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Unmount the disc:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;umount /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Try to create an ISO image of the disc with &lt;em&gt;readom&lt;/em&gt;:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;readom &lt;span class=&quot;nv&quot;&gt;retries&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4 &lt;span class=&quot;nv&quot;&gt;dev&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/sr0 &lt;span class=&quot;nv&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;disc.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If &lt;em&gt;readom&lt;/em&gt; fails, try to image the disc with &lt;em&gt;ddrescue&lt;/em&gt;:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ddrescue &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 2048 &lt;span class=&quot;nt&quot;&gt;-r4&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /dev/sr0 disc.iso disc.map
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If &lt;em&gt;ddrescue&lt;/em&gt; was unable to recover all the data on the disc, try to improve the result by re-running &lt;em&gt;ddrescue&lt;/em&gt; in direct disc mode:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ddrescue &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 2048 &lt;span class=&quot;nt&quot;&gt;-r4&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /dev/sr0 disc.iso disc.map
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If there are still read errors after the above command, try to improve the result by re-running &lt;em&gt;ddrescue&lt;/em&gt; with another optical drive (e.g. an external USB-drive):&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ddrescue &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 2048 &lt;span class=&quot;nt&quot;&gt;-r4&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /dev/sr1 disc.iso disc.map
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;(Note that steps 4 and 5 can be repeated for mutiple optical drives, if needed).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Check the extracted ISO image for completeness with &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;&lt;em&gt;isolyzer&lt;/em&gt;&lt;/a&gt;:&lt;/p&gt;

    &lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;isolyzer disc.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;    &lt;/div&gt;

    &lt;p&gt;If the value of the &lt;em&gt;smallerThanExpected&lt;/em&gt; element equals &lt;em&gt;False&lt;/em&gt;, this is an indication that the ISO image is probably intact.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;However, this is only part of the story. In most cases we will also want to record various types of metadata about the created disc image. A pretty minimal set includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Fixity information: a SHA-512 checksum of the ISO image.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Descriptive metadata: an identifier that is associated with the disc, a description (this may simply be copied from the writing on a disc or its inlay card), and a text annotation for recording anything else about the disc that is noteworthy (e.g. its condition, or the fact that the entered description was based on handwritten text that is not clearly legible).&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/images/2019/03/cd-writing.jpg&quot; alt=&quot;CD-ROM with handwritten text&quot; /&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Event metadata about the image acquisition process: the imaging software (and its associated version) that was used, the options it was invoked with, the status of the imaging process, and the outcome of any quality checks on the generated ISO image.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Adding these metadata by hand is pretty cumbersome, especially when processing multiple discs. It is also prone to all sorts of errors. So, I wrapped all the imaging, quality checks and metadata generation into a user-friendly piece of software with a graphical interface, similar to the earlier &lt;a href=&quot;https://github.com/KBNLresearch/tapeimgr&quot;&gt;&lt;em&gt;tapeimgr&lt;/em&gt;&lt;/a&gt; tool (which served as a template for &lt;em&gt;omimgr&lt;/em&gt;).&lt;/p&gt;

&lt;h2 id=&quot;omimgr-operation&quot;&gt;Omimgr operation&lt;/h2&gt;

&lt;p&gt;Imaging optical media with &lt;em&gt;omimgr&lt;/em&gt; is simple. On start-up, it shows the following entry form:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2019/03/omimgr-1.png&quot; alt=&quot;Screenshot of omimgr interface at startup&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The entry fields are largely self-explanatory (but they are all &lt;a href=&quot;https://github.com/KBNLresearch/omimgr&quot;&gt;documented here&lt;/a&gt;). The user can select an output directory, specify a device path that points to the optical drive (by default it uses the internal drive at &lt;em&gt;/dev/sr0&lt;/em&gt;), and a preferred read method (by default &lt;em&gt;omimgr&lt;/em&gt; starts with &lt;em&gt;readom&lt;/em&gt;)&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. There is also a set of entry fields for basic descriptive metadata, and it is possible to assign an identifier (or generate one automatically). Imaging starts after the user presses the &lt;em&gt;Start&lt;/em&gt; button.&lt;/p&gt;

&lt;h2 id=&quot;if-readom-fails&quot;&gt;If readom fails&lt;/h2&gt;

&lt;p&gt;If the initial attempt to image a disc with &lt;em&gt;readom&lt;/em&gt; resulted in any errors, &lt;em&gt;omimgr&lt;/em&gt; shows the following dialog box:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2019/03/error-readom.png&quot; alt=&quot;Screenshot of omimgr interface after readom error&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After pressing &lt;em&gt;Yes&lt;/em&gt;, &lt;em&gt;omimgr&lt;/em&gt; tries to image the disc with &lt;em&gt;ddrescue&lt;/em&gt;&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. If the &lt;em&gt;ddrescue&lt;/em&gt; run resulted in any errors as well, &lt;em&gt;omimgr&lt;/em&gt; offers the possibility to re-run it with different settings. There is no limit on the number of successive &lt;em&gt;ddrescue&lt;/em&gt; runs on a disc, and, importantly, new runs do not overwrite the existing ISO image, but instead improve upon it. This makes it possible to use multiple optical drives on a disc.&lt;/p&gt;

&lt;h2 id=&quot;interrupting-and-resuming&quot;&gt;Interrupting and resuming&lt;/h2&gt;

&lt;p&gt;Since &lt;em&gt;ddrescue&lt;/em&gt; may need a &lt;em&gt;lot&lt;/em&gt; of time to recover data from a faulty disc (12-24 hours is no exception), it is possible to interrupt ongoing imaging processes with the &lt;em&gt;Interrupt&lt;/em&gt; button. Interrupted &lt;em&gt;ddrescue&lt;/em&gt; sessions can be resumed at any later time by selecting the session’s output directory. The &lt;em&gt;Load existing metadata&lt;/em&gt; button will then load any previously entered descriptive metadata.&lt;/p&gt;

&lt;h2 id=&quot;metadata&quot;&gt;Metadata&lt;/h2&gt;

&lt;p&gt;A the end of each session, &lt;em&gt;omimgr&lt;/em&gt; writes a metadata file in &lt;em&gt;JSON&lt;/em&gt; format. Here’s an example:&lt;/p&gt;

&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;acquisitionEnd&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2019-03-22T13:38:51.969934+01:00&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;acquisitionStart&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2019-03-22T13:37:43.060185+01:00&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;autoRetry&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;checksumType&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;SHA-512&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;checksums&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;backupjrc.iso&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;714426e7f965e4f6b33571ae4d60d945928dbee8c06f74225a138eaaa4ea4b2b7442620227e94920a0bc7ac17a6c7096fb310746cfff2c04b5c3e778ae8998ce&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;description&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Backup JRC 31-03-2000&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;extension&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;iso&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;identifier&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;e23f9158-4c9e-11e9-bbfc-dc4a3e413173&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;imageTruncated&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;interruptedFlag&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;isolyzerSuccess&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;maxRetries&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;4&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;notes&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Outer edge of CD shows signs of corrosion&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;omDevice&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;/dev/sr1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;omimgrVersion&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;0.1.0&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;prefix&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;backupjrc&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;readCommandLine&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;readom retries=4 dev=/dev/sr1 f=/home/johan/test/backupjrc.iso&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;readMethod&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;readom&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;readMethodVersion&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;readom 1.1.11 (Linux)&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;rescueDirectDiscMode&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
    &lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;&quot;successFlag&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;kc&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that the metadata file contains all of the entered descriptive metadata, a SHA-512 checksum of the ISO image, and a host of event metadata.&lt;/p&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;p&gt;Currently, &lt;em&gt;omimgr&lt;/em&gt; only works on Linux-based systems. It does not support audio CDs, and can only be used for CD-ROMs and DVDs. &lt;a href=&quot;https://en.wikipedia.org/wiki/Blue_Book_(CD_standard)&quot;&gt;&lt;em&gt;Blue Book / CD-Extra&lt;/em&gt;&lt;/a&gt; discs aren’t supported either. I wouldn’t rule out that these types of discs may eventually be supported in some future version of &lt;em&gt;omimgr&lt;/em&gt;, but for now they don’t have any priority (currently the main use case is the imaging of recordable CD-ROMs and DVDs).&lt;/p&gt;

&lt;p&gt;It is also worth noting that &lt;em&gt;omimgr&lt;/em&gt; only exposes a limited subset of &lt;em&gt;readom&lt;/em&gt;’s and &lt;em&gt;ddrescue&lt;/em&gt;’s functionality to the user. If you are looking for a full graphical front-end to &lt;em&gt;ddrescue&lt;/em&gt; that gives access to all of its options, you should probably check out &lt;a href=&quot;https://launchpad.net/ddrescue-gui&quot;&gt;&lt;em&gt;DDRescue-GUI&lt;/em&gt;&lt;/a&gt; instead&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. Finally, the current version of &lt;em&gt;omimgr&lt;/em&gt; is an initial release, which so far has had limited testing, so use at your own risk (as always), and feel free to &lt;a href=&quot;https://github.com/KBNLresearch/omimgr/issues&quot;&gt;report any issues&lt;/a&gt; that you may come across.&lt;/p&gt;

&lt;h2 id=&quot;link-to-omimgr&quot;&gt;Link to omimgr&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Omimgr&lt;/em&gt; and its documentation can be found here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/omimgr&quot;&gt;&lt;em&gt;omimgr - Simple workflow tool for imaging optical media&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Note that most of the default values shown here can be easily changed by modifying a configuration file. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Checking the &lt;em&gt;Auto-retry&lt;/em&gt; option bypasses this dialog, in which case &lt;em&gt;omimgr&lt;/em&gt; will automatically start &lt;em&gt;ddrescue&lt;/em&gt; without any user intervention. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Incidentally &lt;em&gt;omimgr&lt;/em&gt; contains a few lines of code that I borrowed from &lt;em&gt;DDRescue-GUI&lt;/em&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2019/03/22/a-simple-workflow-tool-for-imaging-optical-media-using-readom-and-ddrescue</link>
                <guid>https://bitsgalore.org/2019/03/22/a-simple-workflow-tool-for-imaging-optical-media-using-readom-and-ddrescue</guid>
                <pubDate>2019-03-22T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Roll the tape - recovering '90s data tapes in BitCurator</title>
                <description>&lt;p&gt;When the &lt;a href=&quot;https://www.kb.nl/en/organisation/research-expertise/long-term-usability-of-digital-resources/web-archiving&quot;&gt;KB web archive&lt;/a&gt; was launched in 2007, many sites from the “early” Dutch web had already gone offline. As a result, the time period between (roughly) 1992 and 2000 is seriously under-represented in our web archive. To improve the coverage of web sites from this historically important era, we are now looking into &lt;a href=&quot;https://hart.amsterdam/image/2016/11/28/20160730_redds_tjardadehaan.pdf&quot;&gt;Web Archaeology&lt;/a&gt; tools and methods. Over the last year our web archiving team has reached out to creators of “early” Dutch web sites that are no longer online. It’s not uncommon to find that these creators still have boxes of offline carriers with the original source data of those sites. Using these data, we would (in many cases) be able to reconstruct the sites, similarly to how we &lt;a href=&quot;/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited&quot;&gt;reconstructed the first Dutch web index&lt;/a&gt; last year. Once reconstructed, they could then be ingested into our web archive.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;physical-carrier-formats&quot;&gt;Physical carrier formats&lt;/h2&gt;

&lt;p&gt;At this early stage of the web archaeology project, a few site creators have already lended us sample sets of offline carriers. Even though these sets are limited in size, they already contain quite a wide range of physical formats: CD-ROMs, floppy disks, ZIP disks, USB thumb drives, (internal) hard disks, and a variety of tape formats. For reading the data on these carriers, we’ve set up a desktop workstation running the &lt;a href=&quot;https://bitcurator.net/&quot;&gt;&lt;em&gt;BitCurator&lt;/em&gt;&lt;/a&gt; environment.&lt;/p&gt;

&lt;h2 id=&quot;tapes&quot;&gt;Tapes&lt;/h2&gt;

&lt;p&gt;One of the first sample sets we received contains a collection of over 30 data tapes from the mid to late ’90s. Roughly half of these are &lt;a href=&quot;https://en.wikipedia.org/wiki/Digital_Data_Storage&quot;&gt;&lt;em&gt;DDS-1&lt;/em&gt;&lt;/a&gt; tapes, a format based on &lt;a href=&quot;https://en.wikipedia.org/wiki/Digital_Audio_Tape&quot;&gt;&lt;em&gt;Digital Audio Tape&lt;/em&gt;&lt;/a&gt;. The other half are &lt;em&gt;DLT-IV&lt;/em&gt; tapes, a type of &lt;a href=&quot;https://en.wikipedia.org/wiki/Digital_Linear_Tape&quot;&gt;&lt;em&gt;Digital Linear Tape&lt;/em&gt;&lt;/a&gt;. The remainder of this blog post explains how we set up a workflow for reading these tapes. It also highlights some of the particular challenges we encountered along the way, and gives some (hopefully useful) hints and suggestions for others who are interested in setting up a similar workflow.&lt;/p&gt;

&lt;h2 id=&quot;tape-drives&quot;&gt;Tape drives&lt;/h2&gt;

&lt;p&gt;Obviously, to read these vintage tape formats you first need tape drives that support them. Luckily, our IT department turned out to have a working &lt;em&gt;DDS-2&lt;/em&gt; drive (which also reads &lt;em&gt;DDS-1&lt;/em&gt; tapes), as well as a &lt;em&gt;DLT-IV&lt;/em&gt; drive tucked away on a shelf.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/01/dlt-insert.jpg&quot; alt=&quot;Photograph of DLT-IV drive with tape&quot; /&gt;
  &lt;figcaption&gt;DLT-IV drive with tape&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;scsi-madness&quot;&gt;SCSI madness&lt;/h2&gt;

&lt;p&gt;Since tape drives typically use &lt;a href=&quot;https://en.wikipedia.org/wiki/Parallel_SCSI&quot;&gt;parallel &lt;em&gt;SCSI&lt;/em&gt;&lt;/a&gt; connectors, hooking them up to a modern PC is not straightforward, and requires the installation of a &lt;a href=&quot;https://en.wikipedia.org/wiki/SCSI_host_adapter&quot;&gt;&lt;em&gt;SCSI&lt;/em&gt; host adapter&lt;/a&gt; (AKA &lt;em&gt;SCSI&lt;/em&gt; controller). Although used ones are available cheap online, choosing the right one can be tricky. Many older models are &lt;a href=&quot;https://en.wikipedia.org/wiki/Conventional_PCI&quot;&gt;&lt;em&gt;conventional PCI cards&lt;/em&gt;&lt;/a&gt;, which are not compatible with most modern motherboards (which these days are more likely to have &lt;a href=&quot;https://en.wikipedia.org/wiki/PCI_Express&quot;&gt;&lt;em&gt;PCI Express&lt;/em&gt;&lt;/a&gt; slots). Also, beware that many &lt;em&gt;SCSI&lt;/em&gt; adapters have a 64-bit &lt;em&gt;PCI&lt;/em&gt; interface, which is only compatible with enterprise servers&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Since online sellers often don’t mention the interface type, &lt;a href=&quot;https://storage.microsemi.com/en-us/support/scsi/&quot;&gt;this website with Adaptec &lt;em&gt;SCSI&lt;/em&gt; Card Specifications&lt;/a&gt; is a useful resource for checking potential compatibility issues.&lt;/p&gt;

&lt;p&gt;In addition, rather than being one well-defined standard, parallel &lt;em&gt;SCSI&lt;/em&gt; is really a hot mess of different standards that use different interfaces and connector types. Not all of these interfaces and connectors are mutually compatible, and interface mismatches &lt;a href=&quot;https://twitter.com/charles_forsyth/status/1004356758893154305&quot;&gt;can result in actual hardware damage&lt;/a&gt;. The Wikipedia entry on parallel &lt;em&gt;SCSI&lt;/em&gt; &lt;a href=&quot;https://en.wikipedia.org/wiki/Parallel_SCSI#Compatibility&quot;&gt;contains some useful information on compatibility issues&lt;/a&gt;; a more &lt;a href=&quot;http://www.paralan.com/scsiexpert.html&quot;&gt;in-depth discussion can be found here&lt;/a&gt;. The same web site (a treasure trove of all things &lt;em&gt;SCSI&lt;/em&gt;) also has this &lt;a href=&quot;http://www.paralan.com/sediff.html&quot;&gt;illustrated overview of the most common connector types&lt;/a&gt;, which I found immensely helpful for identifying the connector types of our tape drives. Because of the myriad &lt;em&gt;SCSI&lt;/em&gt; connector types, you may also need adapter plugs or cables to connect the tape drive to the &lt;em&gt;SCSI&lt;/em&gt; controller (in our case we used &lt;a href=&quot;https://web.archive.org/web/20181002103944/https://www.ramelectronics.net/sm-044-r.aspx&quot;&gt;this adapter plug&lt;/a&gt; to connect the 68-pin high-density cable of our &lt;em&gt;DDS&lt;/em&gt; drive to &lt;em&gt;SCSI&lt;/em&gt; controller’s &lt;em&gt;VHCDI&lt;/em&gt; connector). Finding matching cables and adapters can be quite a challenge, not least because multiple names are used for most &lt;em&gt;SCSI&lt;/em&gt; connector types. For instance, the commonly used 68-pin “DB68” connector is also known as “MD68”, “High-Density”, “HD 68”, “Half-Pitch” and “HP68”. Aaargh!&lt;/p&gt;

&lt;p&gt;Another thing to keep in mind is that any unused &lt;em&gt;SCSI&lt;/em&gt; buses on the tape drive must be fitted with a &lt;a href=&quot;https://en.wikipedia.org/wiki/Parallel_SCSI#Termination&quot;&gt;terminator&lt;/a&gt;, so if your drive doesn’t have one already you’ll have to track down a matching type.&lt;/p&gt;

&lt;h2 id=&quot;tapeimgr-software&quot;&gt;Tapeimgr software&lt;/h2&gt;

&lt;p&gt;With the &lt;em&gt;SCSI&lt;/em&gt; controller inserted into our &lt;em&gt;BitCurator&lt;/em&gt; workstation, I hooked up one of the tape drives, and tried to read some test tapes&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. Since we want to read the tape data in a format-agnostic way that is independent of the software that was originally used to write the tapes, I used Unix &lt;a href=&quot;https://en.wikipedia.org/wiki/Dd_%28Unix%29&quot;&gt;&lt;em&gt;dd&lt;/em&gt;&lt;/a&gt; (and the &lt;a href=&quot;https://linux.die.net/man/1/mt&quot;&gt;&lt;em&gt;mt&lt;/em&gt;&lt;/a&gt; tool to issue tape transport commands). After some experimentation, I was able to write a simple Bash script that sequentially reads all files on the tape. I then rewrote the script into what was to become the &lt;a href=&quot;https://github.com/KBNLresearch/tapeimgr&quot;&gt;&lt;em&gt;tapeimgr&lt;/em&gt;&lt;/a&gt; software. &lt;em&gt;Tapeimgr&lt;/em&gt; (which was loosely inspired by the &lt;a href=&quot;https://guymager.sourceforge.io/&quot;&gt;&lt;em&gt;Guymager&lt;/em&gt;&lt;/a&gt; software) allows one to read data from a tape using a simple and user-friendly graphical interface. Internally, &lt;em&gt;tapeimgr&lt;/em&gt; just wraps around &lt;em&gt;dd&lt;/em&gt; and &lt;em&gt;mt&lt;/em&gt;, but the complexities of these tools are hidden from the user. For a given tape, the software reads all files; before reading a file, it first runs an iterative procedure to establish the block size that was used for writing it. Once the end of a tape is reached, &lt;em&gt;tapeimgr&lt;/em&gt; computes &lt;em&gt;SHA512&lt;/em&gt; checksums of all recovered files, and these are subsequently written to a .json file (alongside some basic descriptive and event metadata). A detailed extraction log with the full output of &lt;em&gt;dd&lt;/em&gt; is also written for each tape.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2019/01/tapeimgr-2.png&quot; alt=&quot;Screenshot of Tapeimgr interface&quot; /&gt;
  &lt;figcaption&gt;Tapeimgr interface&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;post-processing&quot;&gt;Post-processing&lt;/h2&gt;

&lt;p&gt;Since &lt;em&gt;tapeimgr&lt;/em&gt; is format-agnostic, it’s up to the user to figure out how to further process the recovered files. Identifying the file format is the first step, and this can be done using the usual suspects such as Unix &lt;a href=&quot;https://linux.die.net/man/1/file&quot;&gt;&lt;em&gt;file(1)&lt;/em&gt;&lt;/a&gt;, &lt;a href=&quot;https://www.itforarchivists.com/siegfried/&quot;&gt;&lt;em&gt;Siegfied&lt;/em&gt;&lt;/a&gt; (both accessible through right-click context menus in &lt;em&gt;BitCurator&lt;/em&gt;). Once the format is known, format-specific tools (e.g. &lt;a href=&quot;https://linux.die.net/man/1/tar&quot;&gt;&lt;em&gt;tar&lt;/em&gt;&lt;/a&gt;) can be used to extract the files’ contents.&lt;/p&gt;

&lt;h2 id=&quot;recovering-the-first-sample-set&quot;&gt;Recovering the first sample set&lt;/h2&gt;

&lt;p&gt;After we were confident that our tape processing workflow worked correctly, we used it to process the sample set of &lt;em&gt;DDS&lt;/em&gt; and &lt;em&gt;DLT-IV&lt;/em&gt; tapes that were lended to us. The majority of the 19 &lt;em&gt;DDS&lt;/em&gt; tapes in the sample set could be read without problems. Only 3 tapes resulted in any issues. Two tapes could not be read at all; both of them turned out to be &lt;em&gt;DDS-3&lt;/em&gt; tapes, which are not supported by the &lt;em&gt;DDS-2&lt;/em&gt; tape drive we used. A &lt;em&gt;DDS-3&lt;/em&gt; or &lt;em&gt;DDS-4&lt;/em&gt; drive should be able to read these tapes. For one other tape the extraction resulted in a 10-kB file with only null bytes, which most likely means the tape is faulty. Of the 14 &lt;em&gt;DLT-IV&lt;/em&gt; tapes, 7 could be read without problems. For the remaining 7, the reading procedure only resulted in a zero-length file. Interestingly, a common characteristic of all “failed” tapes is that they were written at 40.0 GB capacity, whereas the other tapes were written at 35.0 GB capacity. This is odd, as our &lt;em&gt;DLT-IV&lt;/em&gt; drive &lt;em&gt;does&lt;/em&gt; support 40.0 GB capacity tapes (which was confirmed by writing some data to a blank test tape at 40.0 GB capacity, which could subsequently be read without problems). This needs some further investigation.&lt;/p&gt;

&lt;h2 id=&quot;next-steps&quot;&gt;Next steps&lt;/h2&gt;

&lt;p&gt;The next step (which we haven’t started yet) is to extract the contents of the recovered files. A cursory look suggests that most recovered files in our sample set use the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Unix_dump&quot;&gt;Unix &lt;em&gt;dump&lt;/em&gt; format&lt;/a&gt;, which can be opened and extracted relatively easily. There are also some &lt;a href=&quot;https://en.wikipedia.org/wiki/Tar_(computing)&quot;&gt;&lt;em&gt;tar&lt;/em&gt;&lt;/a&gt; archives, which are even easier to extract. Once that is done, the real job of reconstructing the web sites that they contain can start, but that will be a different story altogether.&lt;/p&gt;

&lt;h2 id=&quot;workflow-descriptions-and-other-resources&quot;&gt;Workflow descriptions and other resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The workflow descriptions for &lt;a href=&quot;https://github.com/KBNLresearch/forensicImagingResources/blob/master/doc/tape-dds.md&quot;&gt;&lt;em&gt;DDS-1&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;https://github.com/KBNLresearch/forensicImagingResources/blob/master/doc/tape-dlt.md&quot;&gt;&lt;em&gt;DLT-IV&lt;/em&gt;&lt;/a&gt; tapes are available on Github. They also include a detailed description of all hardware components we used, including links to original documentation (if available).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://github.com/KBNLresearch/tapeimgr&quot;&gt;&lt;em&gt;tapeimgr&lt;/em&gt; software is available here&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Finally, we’re maintaining a categorised &lt;a href=&quot;https://github.com/KBNLresearch/forensicImagingResources/blob/master/doc/df-resources.md&quot;&gt;Digital forensics and web archaeology resources list&lt;/a&gt;, which contains links to many additional tape-related resources.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to Peter Boel and René van Egdom for their help digging out the tape drives and other obscure hardware peripherals, and Willem Jan Faber for various helpful hardware-related suggestions.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2019/01/31/roll-the-tape-recovering-90s-data-tapes-in-bitcurator/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Also see &lt;a href=&quot;https://upload.wikimedia.org/wikipedia/commons/1/15/PCI_Keying.svg&quot;&gt;this useful diagram&lt;/a&gt; that shows different PCI card types. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Importantly, for the testing phase we only used some unimportant tapes that we still happened to have lying around. This was done to minimise any chance of accidental damage to the tapes that were lended to us (we did not know in advance whether the tape drives were still working correctly!). &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2019/01/31/roll-the-tape-recovering-90s-data-tapes-in-bitcurator</link>
                <guid>https://bitsgalore.org/2019/01/31/roll-the-tape-recovering-90s-data-tapes-in-bitcurator</guid>
                <pubDate>2019-01-31T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Crawling offline web content&#58; the NL-menu case</title>
                <description>&lt;p&gt;In a &lt;a href=&quot;/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited&quot;&gt;previous blog post&lt;/a&gt; I showed how we resurrected &lt;em&gt;NL-menu&lt;/em&gt;, the first Dutch web index. It explains how we recovered the site’s data from an old CD-ROM, and how we subsequently created a local copy of the site by &lt;a href=&quot;https://github.com/KBNLresearch/nl-menu-resources/blob/master/doc/serving-static-website-with-Apache.md&quot;&gt;serving the CD-ROM’s contents on the &lt;em&gt;Apache&lt;/em&gt; web server&lt;/a&gt;. This follow-up post covers the final step: crawling the resurrected site to a &lt;a href=&quot;https://en.wikipedia.org/wiki/Web_ARChive&quot;&gt;WARC&lt;/a&gt; file that can be ingested into our web archive.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;previous-work&quot;&gt;Previous work&lt;/h2&gt;

&lt;p&gt;A &lt;a href=&quot;https://kia.pleio.nl/file/download/55806143/Report%20on%20web-archiving%20in%20the%20Dutch%20National%20Archives.pdf&quot;&gt;2016 report by Jeroen van Luin&lt;/a&gt; documents the web archiving workflows that are used by the National Archives of the Netherlands. Interestingly, it also covers the archiving of websites that are no longer online from local copies. Since this is similar to our &lt;em&gt;NL-menu&lt;/em&gt; case, I took this as a starting point. The report mentions two general workflows for crawling from localhost:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;One based on the &lt;a href=&quot;https://github.com/internetarchive/heritrix3&quot;&gt;&lt;em&gt;Heritrix&lt;/em&gt;&lt;/a&gt; crawler.&lt;/li&gt;
  &lt;li&gt;Another one that uses the &lt;a href=&quot;https://www.gnu.org/software/wget/&quot;&gt;&lt;em&gt;wget&lt;/em&gt;&lt;/a&gt; tool.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;preparation-change-machine-system-date-disable-network-connection&quot;&gt;Preparation: change machine system date, disable network connection&lt;/h2&gt;

&lt;p&gt;In most cases it will be desirable that the site snapshot will appear in the Wayback timeline around the year/date it was actually online. This can be achieved by setting the computer’s system date to that date. This has to be done before running the crawl. In the case of &lt;em&gt;NL-menu&lt;/em&gt; I used the “last modified” time stamp of the files on the CD-ROM filesystem as an approximation. On a Linux-based system the following command will do the trick:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;sudo date&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;2004-01-23 21:03:09.000&quot;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Also, to completely rule out anything from the “live” web leaking into the crawl, it may be prudent to disable the network connection at this point (both wired and wireless connections).&lt;/p&gt;

&lt;h2 id=&quot;heritrix&quot;&gt;Heritrix&lt;/h2&gt;

&lt;p&gt;I started out with some limited tests with &lt;em&gt;Heritrix&lt;/em&gt; 3, but this resulted in several problems. Most importantly, &lt;em&gt;Heritrix&lt;/em&gt; appeared to ignore the &lt;em&gt;hosts&lt;/em&gt; file on my machine (this file maps the domain &lt;em&gt;www.nl-menu.nl&lt;/em&gt; to the &lt;em&gt;localhost&lt;/em&gt; IP-adress). The effect of this was, that with the network disabled the crawl job would run indefinitely without ever downloading any data. After enabling the network, &lt;em&gt;Heritrix&lt;/em&gt; would crawl &lt;a href=&quot;http://www.nl-menu.nl/&quot;&gt;the “live” site at &lt;em&gt;www.nl-menu.nl&lt;/em&gt;&lt;/a&gt; instead of the locally resurrected version&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Because of this, I quickly gave up on &lt;em&gt;Heritrix&lt;/em&gt;, and moved on to &lt;em&gt;wget&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;wget-first-attempt&quot;&gt;wget: first attempt&lt;/h2&gt;

&lt;p&gt;After some experimentation, the following set of &lt;em&gt;wget&lt;/em&gt; options appeared to work reasonably well&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget &lt;span class=&quot;nt&quot;&gt;--mirror&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--page-requisites&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--adjust-extension&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--warc-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nl-menu&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--warc-cdx&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--output-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nl-menu.log&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    http://www.nl-menu.nl/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The above command results in a 200 MB compressed WARC file, a mirror of the crawled directory tree, a CDX index file (not really nececessary, but useful for quality checks), and a log file.&lt;/p&gt;

&lt;h2 id=&quot;rendering-the-warc&quot;&gt;Rendering the WARC&lt;/h2&gt;

&lt;p&gt;In order to test the rendering of the WARC locally, I installed &lt;a href=&quot;https://github.com/webrecorder/pywb&quot;&gt;&lt;em&gt;pywb&lt;/em&gt;&lt;/a&gt; (which is part of the Webrecorder project). I then created a test archive using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wb-manager init my-web-archive
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;and then added my newly created WARC using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wb-manager add my-web-archive ~/NL-menu/warc-wget/nl-menu.warc.gz
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I then started the server with the command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wayback
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After this the archive is available from &lt;a href=&quot;http://localhost:8080/my-web-archive/&quot;&gt;http://localhost:8080/my-web-archive/&lt;/a&gt;. Below is a screenshot of one page:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/07/NL-menu-pywb.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Note that the &lt;em&gt;Archived&lt;/em&gt; date as shown by &lt;em&gt;pywb&lt;/em&gt; corresponds to our modified system date (i.e. 23 january 2004).&lt;/p&gt;

&lt;h2 id=&quot;completeness-checks&quot;&gt;Completeness checks&lt;/h2&gt;

&lt;p&gt;An analysis of the completeness of the &lt;em&gt;wget&lt;/em&gt; capture&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt; revealed that over 660 files from the &lt;em&gt;NL-menu&lt;/em&gt; source directory tree were missing in the crawl. Most (90%) of these were missing because they are simply not referenced (by way of a hyperlink) by any of the website resources that are crawled from the site root. Of the remaining 64 missing files, 51 are referenced through JavaScript variables (which are understandably not picked up by wget’s crawl mechanism). Other, less common reasons were:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;A file is only referenced through a &lt;em&gt;value&lt;/em&gt; attribute of an &lt;em&gt;input&lt;/em&gt; element.&lt;/li&gt;
  &lt;li&gt;A file is only referenced through a &lt;em&gt;src&lt;/em&gt; attribute of a &lt;em&gt;frame&lt;/em&gt; element.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This raises the obvious question: can we force &lt;em&gt;wget&lt;/em&gt; to crawl &lt;em&gt;all&lt;/em&gt; files in the source directory? Of course we can!&lt;/p&gt;

&lt;h2 id=&quot;wget-use-input-file-switch-with-url-list&quot;&gt;Wget: use –input-file switch with URL list&lt;/h2&gt;

&lt;p&gt;The solution here is to use &lt;em&gt;wget&lt;/em&gt;’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--input-file&lt;/code&gt; switch, which takes a list of URLs which are sequentially crawled. As a first step we need to create a directory listing of the source directory of the website, and then transform each file entry into a corresponding URL. I did this using the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;find /var/www/www.nl-menu.nl &lt;span class=&quot;nt&quot;&gt;-type&lt;/span&gt; f &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    | &lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/\/var\/www\//http:\/\//g&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; urls.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I then ran &lt;em&gt;wget&lt;/em&gt;&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget &lt;span class=&quot;nt&quot;&gt;--page-requisites&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--warc-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nl-menu&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--warc-cdx&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--output-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nl-menu.log&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--input-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;urls.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results in a WARC file that contains &lt;em&gt;all&lt;/em&gt; files from the source directory. But it does introduce a different problem: when the WARC is accessed using &lt;em&gt;pywb&lt;/em&gt;, it shows up as over 80 thousand individual captures (i.e. each file appears to be treated as an individual capture)! This makes rendering of the WARC near impossible (if only because loading the list of captures is extremely slow to begin with).&lt;/p&gt;

&lt;p&gt;After getting in touch with &lt;em&gt;pywb&lt;/em&gt; author Ilya Kreymer, Ilya pointed out that &lt;em&gt;pywb&lt;/em&gt;’s behaviour here is the combined effect of a bug in &lt;em&gt;pywb&lt;/em&gt; and a peculiarity of the &lt;em&gt;NL-menu&lt;/em&gt; directory tree: unlike most websites, it does not contain an index document (e.g. &lt;em&gt;index.html&lt;/em&gt;) at the domain root level. Instead, the domain root contains 2 directories which hold the Dutch and English-language versions of the site, respectively:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;/var/www/www.nl-menu.nl/
├── nlmenu.en
│   ├── index.html
│   ├── ...
│   ├── ...
│   └── ...
└── nlmenu.nl
    ├── index.html
    ├── ...
    ├── ...
    └── ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The result of this is that the input URL list (which is derived from files in the directory tree) does not contain the site’s root URL (&lt;em&gt;http://www.nl-menu.nl&lt;/em&gt;), which in turn ends up missing in the WARC. Ultimately this leads &lt;em&gt;pywb&lt;/em&gt; to do a prefix query which in this case results in 80 thousand URLs!&lt;/p&gt;

&lt;h2 id=&quot;improved-url-list&quot;&gt;Improved URL list&lt;/h2&gt;

&lt;p&gt;Ilya suggested to avoid this problem by explicitly adding the domain root to the URL list:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;http://www.nl-menu.nl/&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; urls.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Incidentally we also need to add entries for the root directories of the Dutch and English language sub-sites:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;http://www.nl-menu.nl/nlmenu.nl/&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; urls.txt
&lt;span class=&quot;nb&quot;&gt;echo&lt;/span&gt; &lt;span class=&quot;s2&quot;&gt;&quot;http://www.nl-menu.nl/nlmenu.en/&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; urls.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then we can add the remaining files (and rewrite file paths as URLs) as before:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;find /var/www/www.nl-menu.nl &lt;span class=&quot;nt&quot;&gt;-type&lt;/span&gt; f &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    | &lt;span class=&quot;nb&quot;&gt;sed&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-e&lt;/span&gt; &lt;span class=&quot;s1&quot;&gt;&apos;s/\/var\/www\//http:\/\//g&apos;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; urls.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally run &lt;em&gt;wget&lt;/em&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;wget &lt;span class=&quot;nt&quot;&gt;--page-requisites&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--warc-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nl-menu&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--warc-cdx&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--output-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nl-menu.log&quot;&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;--input-file&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;urls.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results in a WARC that is complete &lt;em&gt;and&lt;/em&gt; renders in &lt;em&gt;pywb&lt;/em&gt; as well. On a side note, although rendering and navigating the site works well overall, there are still some small issues. For one thing, coming from the Dutch or English-language home page, clicking on the link to the site’s &lt;em&gt;registration&lt;/em&gt; page results in a &lt;em&gt;Url Not Found&lt;/em&gt; error, even though this page loads without problems coming from any other page on the site. I don’t really understand why this happens, although I suspect a combination of JavaScript and early-2000s &lt;a href=&quot;https://en.wikipedia.org/wiki/Framing_(World_Wide_Web)&quot;&gt;frames&lt;/a&gt; madness may be to blame here.&lt;/p&gt;

&lt;h2 id=&quot;authenticity-and-provenance&quot;&gt;Authenticity and provenance&lt;/h2&gt;

&lt;p&gt;My &lt;a href=&quot;http://openpreservation.org/blog/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited/&quot;&gt;previous blog post&lt;/a&gt; contained the following observation on authenticity and provenance:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;For one thing, we need to record metadata that makes it absolutely clear that our archival snapshot was taken from a locally reconstructed copy, rather than the original site.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So the more specific question is: does the WARC that we just created contain any metadata about the provenance of this snapshot? To find out, let’s use the &lt;em&gt;warcdump&lt;/em&gt; tool that is part of the &lt;a href=&quot;https://github.com/internetarchive/warctools&quot;&gt;&lt;em&gt;warctools&lt;/em&gt;&lt;/a&gt; toolkit:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;warcdump nl-menu.warc.gz &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; nl-menu-dump.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results in a (huge) text file with metadata about all archived resources inside the WARC. Here is an example of one (request) record:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;archive record at nl-menu.warc.gz:75708014
Headers:
    WARC-Type:request
    WARC-Target-URI:&amp;lt;http://www.nl-menu.nl/nlmenu.en/index.html&amp;gt;
    Content-Type:application/http;msgtype=request
    WARC-Date:2004-01-23T20:04:54Z
    WARC-Record-ID:&amp;lt;urn:uuid:4b05df6b-d408-4f2f-8efb-bbd38098cbdb&amp;gt;
    WARC-IP-Address:127.0.0.1
    WARC-Warcinfo-ID:&amp;lt;urn:uuid:7dacf508-ab81-4244-ae9c-ce04a1e18123&amp;gt;
    WARC-Block-Digest:sha1:MJEIKQUIEOMARJPUEYXZFT4TTTUKX2IE
    Content-Length:159
Content Headers:
    Content-Type : application/http;msgtype=request
    Content-Length : 159
Content:
    GET /nlmenu\x2Een/index\x2Ehtml HTTP/1\x2E1\xD\xAUser\x2DAgent\x3A Wget/1\x2E19 \x28linux\x2Dgnu\x29\xD\xAAccept\x3A \x2A/\x2A\xD\xAAccept\x2DEncoding\x3A identity\xD\xAHost\x3A www\x2Enl\x2Dmenu\x2Enl\xD\xAConnection\x3A Keep\x2DAlive\xD\xA\xD\xA
    ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note this line:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;WARC-IP-Address:127.0.0.1
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The field &lt;em&gt;WARC-IP-Address&lt;/em&gt; is defined in the &lt;a href=&quot;https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#warc-ip-address&quot;&gt;WARC specification&lt;/a&gt; as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The WARC-IP-Address is the numeric Internet address contacted to retrieve any included content. An IPv4 address shall be written as a “dotted quad”; an IPv6 address shall be written as specified in [RFC4291]. For a HTTP retrieval, this will be the IP address used at retrieval time corresponding to the hostname in the record’s target-Uri.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this case, the value 127.0.0.1 (=localhost) unambiguously shows that this resource was crawled from a local copy, and not from the live web. So the required provenance metadata does indeed exist&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The above steps complete the recovery and archiving of the &lt;em&gt;NL-menu&lt;/em&gt; website. In the coming months the KB will start some more “web archaeology” activities, and it will be interesting to see to what extent these &lt;em&gt;wget&lt;/em&gt; “recipes” will work for other offline web content. In any case they may be a useful starting point.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to Ilya Kreymer, Raffaele Messuti and Andy Jackson for their helpful suggestions on &lt;em&gt;wget&lt;/em&gt; and &lt;em&gt;pywb&lt;/em&gt;, and René Voorburg for his suggestions on improving and quality-checking the crawl.&lt;/p&gt;

&lt;h2 id=&quot;additional-resources&quot;&gt;Additional resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Jeroen van Luin, &lt;a href=&quot;https://kia.pleio.nl/file/download/55806143/Report%20on%20web-archiving%20in%20the%20Dutch%20National%20Archives.pdf&quot;&gt;Experiences with web archiving in the Dutch National Archives&lt;/a&gt;. Report, National Archives of the Netherlands&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/nl-menu-resources/blob/master/doc/qa-archived-site.md&quot;&gt;Rough working notes on quality/completeness checks&lt;/a&gt; - unedited notes on some of the tests I did to assess the crawl quality (mostly in terms of completeness)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/webrecorder/pywb&quot;&gt;pywb&lt;/a&gt;, Core Python Web Archiving Toolkit for replay and recording of web archives&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/internetarchive/warctools&quot;&gt;warctools&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2018/07/11/crawling-offline-web-content-the-nl-menu-case/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;A site still exists at this domain, even though its contents are now largely unrelated to the original &lt;em&gt;NL-menu&lt;/em&gt; &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Compared with the examples in the NA documentation, this leaves out the &lt;em&gt;-w&lt;/em&gt; (wait) switch (since we are crawling from a local machine it is safe to crawl at maximum speed), the &lt;em&gt;-k&lt;/em&gt; (convert links) switch, and the &lt;em&gt;-E&lt;/em&gt; (adjust extension) switch. It also adds the &lt;em&gt;–warc-cdx&lt;/em&gt; command (which writes an index file) and the &lt;em&gt;–output-file&lt;/em&gt; switch (which writes a log) &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;A detailed discussion of the methods that were used for this analysis is beyond the scope of this blog post; however some &lt;a href=&quot;https://github.com/KBNLresearch/nl-menu-resources/blob/master/doc/qa-archived-site.md&quot;&gt;rough working notes are available here&lt;/a&gt;. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Note that I removed the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--mirror&lt;/code&gt; option here, as it seems that this causes &lt;em&gt;wget&lt;/em&gt; to do a recursive crawl &lt;em&gt;for each single URL&lt;/em&gt; in the list. The result of this is that &lt;em&gt;wget&lt;/em&gt; keeps crawling for hours without any (new) data being added to the crawl result. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;How this particular type of provenance metadata is exposed to a user of the web archive is a different matter &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2018/07/11/crawling-offline-web-content-the-nl-menu-case</link>
                <guid>https://bitsgalore.org/2018/07/11/crawling-offline-web-content-the-nl-menu-case</guid>
                <pubDate>2018-07-11T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Resurrecting the first Dutch web index&#58; NL-menu revisited</title>
                <description>&lt;p&gt;&lt;em&gt;NL-menu&lt;/em&gt; was the first Dutch web index. The site was originally founded by a consortium of &lt;a href=&quot;https://en.wikipedia.org/wiki/SURFnet&quot;&gt;SURFnet&lt;/a&gt;, Dutch universities and the  KB. From the mid-nineties onwards it was maintained solely by the KB. &lt;em&gt;NL-menu&lt;/em&gt; was &lt;a href=&quot;https://www.robcoers.nl/nl-menu-is-straks-niet-meer-leve-nl-menu/&quot;&gt;discontinued in 2004&lt;/a&gt;, after which the site was taken offline. In 2006 the domain name was sold to a private company that used it for hosting a web index that was partially based on the original &lt;em&gt;NL-menu&lt;/em&gt; site.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;Meanwile, the original &lt;em&gt;NL-menu&lt;/em&gt; has been largely lost to the mists of time. Even though the Internet Archive’s Wayback Machine contains &lt;a href=&quot;https://web.archive.org/web/*/www.nl-menu.nl&quot;&gt;rather a lot of snapshots of the site&lt;/a&gt;, these are incomplete, and don’t capture the original look and feel. For example, &lt;a href=&quot;https://web.archive.org/web/20020603232609/http://www.nl-menu.nl:80/nlmenu.nl/fset/gz.html&quot;&gt;this page&lt;/a&gt; is a snapshot from June 2002:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/04/wayback1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However, this doesn’t look even remotely like the site as it was in 2002. Just for one thing, at the top-left we see a &lt;a href=&quot;https://en.wikipedia.org/wiki/Bing_(search_engine)&quot;&gt;&lt;em&gt;Bing&lt;/em&gt;&lt;/a&gt; search box, but &lt;em&gt;Bing&lt;/em&gt; didn’t even exist until 2009! An inspection of the crawl time stamps (these can be seen by clicking on the top-right &lt;em&gt;About this capture&lt;/em&gt; button) reveals that this “snapshot” is really an amalgam of elements that were crawled at wildly varying dates, some as recently as 2018:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/04/wayback-timestamps.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;NL-menu&lt;/em&gt; is not part of the &lt;a href=&quot;https://www.kb.nl/en/organisation/research-expertise/long-term-usability-of-digital-resources/web-archiving&quot;&gt;KB Web Archive&lt;/a&gt;, as the KB only started its web archiving activities in 2007. The only remaining “complete” copies of &lt;em&gt;NL-menu&lt;/em&gt; are three (recordable) CD-ROMs that were burned shortly before the site was taken offline in 2004.&lt;/p&gt;

&lt;p&gt;As &lt;em&gt;NL-menu&lt;/em&gt; is a unique source of information about the (relatively) early history of the Dutch Internet, we made an attempt at reconstructing the site as it appeared in early 2004. This involved the following steps:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Recover the data from the remaining CD-ROMs&lt;/li&gt;
  &lt;li&gt;Set up a local copy of the site by serving the recovered data om a webserver&lt;/li&gt;
  &lt;li&gt;Crawl the recovered site for inclusion in our web archive&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The remainder of this blog describes how we went about the first two steps.&lt;/p&gt;

&lt;h2 id=&quot;recovering-the-data&quot;&gt;Recovering the data&lt;/h2&gt;

&lt;p&gt;A first attempt at viewing the contents of the CD-ROMs in a file manager resulted in read errors for &lt;em&gt;all&lt;/em&gt; discs. This is not surprising, given the instability of CD-Rs, and the fact these discs were burned in early 2004. So, we tried to recover the contents of the discs with the dedicated data-recovery tool &lt;a href=&quot;https://www.gnu.org/software/ddrescue/&quot;&gt;&lt;em&gt;ddrescue&lt;/em&gt;&lt;/a&gt;. We used the following command line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ddrescue &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 2048 &lt;span class=&quot;nt&quot;&gt;-r4&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /dev/sr0 NL-menu-ddrescue.iso NL-menu-ddrescue.log
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here &lt;em&gt;-d&lt;/em&gt; tells &lt;em&gt;ddrescue&lt;/em&gt; to read the disc using direct disc access mode, &lt;em&gt;-b&lt;/em&gt; sets the block size (which is 2048 bytes for a CD-ROM); &lt;em&gt;-r4&lt;/em&gt; sets the maximum number of retries in case of bad sectors to 4, and &lt;em&gt;-v&lt;/em&gt; activates verbose output mode. File &lt;em&gt;NL-menu-ddrescue.iso&lt;/em&gt; is the image file with the recovered data; &lt;em&gt;NL-menu-ddrescue.log&lt;/em&gt; is a so-called &lt;a href=&quot;https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html#Mapfile-structure&quot;&gt;&lt;em&gt;mapfile&lt;/em&gt;&lt;/a&gt;, which holds information on the recovery status of blocks of data.&lt;/p&gt;

&lt;p&gt;One of the advantages of &lt;em&gt;ddrescue&lt;/em&gt; is that it can be run multiple consecutive times for each disc, using different optical drives if necessary. This is extremely useful, as it is not uncommon to find that some sectors on a disc result in read errors on one drive, whereas those sectors are read without problems by another drive (and vice versa).&lt;/p&gt;

&lt;h2 id=&quot;results-of-recovery-process&quot;&gt;Results of recovery process&lt;/h2&gt;

&lt;p&gt;Out of the three CD-ROMs, only one copy could be fully recovered without any unreadable sectors. The recovery process required multiple passes with &lt;em&gt;ddrescue&lt;/em&gt;, using two computers and four different optical drives (two internal drives, and two external USB drives).&lt;/p&gt;

&lt;p&gt;Only half of the second disc could be recovered after a 16-hour recovery pass with &lt;em&gt;ddrescue&lt;/em&gt;. An inspection of the resulting ISO image in a hex editor showed the recovered sectors of this disc to be byte-identical to the first disc (which was recovered in full). Having established this, we didn’t do any further attempts at recovering more data from this disc (since it is simply a copy of the first disc).&lt;/p&gt;

&lt;p&gt;For the third disc, 99.8% of the data could be recovered after four rounds with &lt;em&gt;ddrescue&lt;/em&gt; with four optical drives. Below image shows a visualisation of the recovery process (made with &lt;a href=&quot;https://sourceforge.net/projects/ddrescueview/&quot;&gt;&lt;em&gt;ddrescueview&lt;/em&gt;&lt;/a&gt;):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/04/ddrescue-cd3.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here, each block respresents one 2048-byte sector, where a red block is a sector with read errors. In this case 468 sectors spread across the disc are unreadable. This means that any files or folder definitions that occupy any of those sectors will be damaged. The resulting ISO image turned out to be readable, but one of the top-level directories (which contains half of the files on the disc) is not shown when the image is mounted or opened in an archive manager. So, we discarded this ISO image from any further processing as well. Unfortunately this disc did &lt;em&gt;not&lt;/em&gt; turn out to be merely a copy of the first disc.&lt;/p&gt;

&lt;h2 id=&quot;inspecting-the-contents-of-the-iso-image&quot;&gt;Inspecting the contents of the ISO image&lt;/h2&gt;

&lt;p&gt;After mounting the ISO image of the first disc (i.e. the one that was recovered without errors) on a Linux machine, the following directory structure appears:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/04/caja-1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;nlmenu.nl&lt;/em&gt; directory contains the Dutch-language version of the site, and &lt;em&gt;nlmenu.en&lt;/em&gt; the English-language version (oddly, there’s no top-level index page!). Here are the contents of the &lt;em&gt;nlmenu.nl&lt;/em&gt; directory:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/04/caja-2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If we open &lt;em&gt;index.html&lt;/em&gt; in a browser (Firefox) we see this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/04/index-from-fs.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We can see here that several images are not rendered; also none of the (internal) hyperlinks work. This happens because all file paths in the underlying HTML are defined relative to the site’s root directory, and these don’t resolve properly on the local file system. In order to render the site correctly we have to serve it from a locally installed web server.&lt;/p&gt;

&lt;h2 id=&quot;serving-the-cd-rom-contents-with-a-web-server&quot;&gt;Serving the CD-ROM contents with a web server&lt;/h2&gt;

&lt;p&gt;So, we installed the &lt;a href=&quot;https://en.wikipedia.org/wiki/Apache_HTTP_Server&quot;&gt;&lt;em&gt;Apache web&lt;/em&gt; server&lt;/a&gt; on a Linux machine, and then configured it to serve the unpacked contents of the ISO image. More details on how we did this can be found in &lt;a href=&quot;https://github.com/KBNLresearch/nl-menu-resources/blob/master/doc/serving-static-website-with-Apache.md&quot;&gt;these technical notes&lt;/a&gt;. Configuring the &lt;em&gt;hosts&lt;/em&gt; file (as explained in the technical notes) allowed us to render the site on &lt;a href=&quot;https://en.wikipedia.org/wiki/Localhost&quot;&gt;localhost&lt;/a&gt; from its original URL:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2018/04/nl-menu-nl.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We could crawl the resurrected site with Heritrix for inclusion in our web archive, but before doing this a number of authenticity-related issues need to be sorted out first. For one thing, we need to record metadata that makes it absolutely clear that our archival snapshot was taken from a locally reconstructed copy, rather than the original site. Also, it’s not completely straightfoward how the archiving date should be defined (is this 2004 ore 2018?). We’d be very interested to hear how other web archives are dealing with these issues.&lt;/p&gt;

&lt;h2 id=&quot;publicly-available-version-of-the-recovered-site&quot;&gt;Publicly available version of the recovered site&lt;/h2&gt;

&lt;p&gt;The KB web archive is only accessible on-site in our reading rooms. Since the KB owns the rights to &lt;em&gt;NL-menu&lt;/em&gt;, we decided to make the reconstructed site available on the KB Research website. In order to make this work, we applied a couple of small changes to the original files:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Relative references to website resources were re-written to reflect the location of the site on the kbresearch domain.&lt;/li&gt;
  &lt;li&gt;All references to the original &lt;em&gt;nl-menu.nl&lt;/em&gt; domain were updated to the &lt;em&gt;kbrearch.nl&lt;/em&gt; domain (most importantly this prevents JavaScript-triggered redirects to the original live &lt;em&gt;nl-menu.nl&lt;/em&gt; domain).&lt;/li&gt;
  &lt;li&gt;The Dutch index page was copied to the site root, so that it’s used as a top-level index (this was done because the CD-ROM has no top-level index page).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The above changes were all made using this &lt;a href=&quot;https://github.com/KBNLresearch/nl-menu-resources/blob/master/scripts/fixhtml.sh&quot;&gt;script&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The result of all this is available here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.kbresearch.nl/nl-menu/nl-menu/&quot;&gt;http://www.kbresearch.nl/nl-menu/nl-menu/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This reflects the state of &lt;em&gt;NL-menu&lt;/em&gt; briefly before it closed down in 2004.&lt;/p&gt;

&lt;p&gt;Although at first glance the reconstructed site appears to be of much better quality than any of the available Wayback snaphots, there are a couple of caveats:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Links to categories that contain sub categories don’t work in Firefox. A workaround is to right-click on the link and open it in a new tab (or disable JavaScript). In Chrome/Chromium these links work normally.&lt;/li&gt;
  &lt;li&gt;The site contains a number of forms that don’t work because the associated CGI scripts are missing (these scripts are not on the CD-ROM).&lt;/li&gt;
  &lt;li&gt;Some pages show the URL &lt;em&gt;http://www.kbresearch.nl&lt;/em&gt;, instead of the original &lt;em&gt;http://www.nl-menu.nl&lt;/em&gt;. This is an unintended side-effect of the script that was used to update the links.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There may be more issues; please feel free to contact us if you spot anything that doesn’t look quite right!&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to the folllowing people for their advice and suggestions: Annemarie Beunen, Willem Jan Faber, Kees Teszelszky, and Lammert Zwaagstra.&lt;/p&gt;

&lt;h2 id=&quot;additional-resources&quot;&gt;Additional resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.kbresearch.nl/nl-menu/nl-menu/&quot;&gt;NL-menu (2004 snapshot at kbresearch.nl)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/nl-menu-resources/blob/master/doc/serving-static-website-with-Apache.md&quot;&gt;Serving a static website with the Apache web server (technical notes)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://zenodo.org/record/881109&quot;&gt;How can we improve our web collection? An evaluation of web archiving at the KB National Library of the Netherlands (2007-2017)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited</link>
                <guid>https://bitsgalore.org/2018/04/24/resurrecting-the-first-dutch-web-index-nl-menu-revisited</guid>
                <pubDate>2018-04-24T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Update on Isolyzer&#58; UDF, HFS+ and more!</title>
                <description>&lt;p&gt;Earlier this year I &lt;a href=&quot;/2017/01/13/detecting-broken-iso-images-introducing-isolyzer&quot;&gt;blogged about &lt;em&gt;Isolyzer&lt;/em&gt;&lt;/a&gt;, a tool designed to help the detection of broken ISO images. Today I released a shiny new &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;beta version&lt;/a&gt; that adds a significant amount of new functionality. Below is an overview of the main changes, followed by some warnings and caveats.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;support-of-more-file-systems&quot;&gt;Support of more file systems&lt;/h2&gt;

&lt;p&gt;Where previous versions only supported disc images with an &lt;a href=&quot;https://en.wikipedia.org/wiki/ISO_9660&quot;&gt;&lt;em&gt;ISO 9660&lt;/em&gt;&lt;/a&gt; file system (with limited support for hybrid &lt;em&gt;ISO 9660&lt;/em&gt;/Apple &lt;em&gt;HFS&lt;/em&gt; file systems), the new release can deal with a much broader range of file systems. In particular, it adds support for the &lt;a href=&quot;https://en.wikipedia.org/wiki/Universal_Disk_Format&quot;&gt;&lt;em&gt;Universal Disk Format&lt;/em&gt;&lt;/a&gt; (&lt;em&gt;UDF&lt;/em&gt;) and Apple’s &lt;a href=&quot;https://en.wikipedia.org/wiki/HFS_Plus&quot;&gt;&lt;em&gt;HFS+&lt;/em&gt;&lt;/a&gt; file system. Unlike previous versions, &lt;em&gt;Isolyzer&lt;/em&gt; can now also deal with Apple disc layouts that don’t contain a &lt;a href=&quot;https://en.wikipedia.org/wiki/Apple_Partition_Map&quot;&gt;partition map&lt;/a&gt; (see also &lt;a href=&quot;https://en.wikipedia.org/wiki/Hybrid_disc#Multiple_file_systems&quot;&gt;here&lt;/a&gt; for more details on Apple disc layouts). Crucially, all of the above are supported both as stand-alone file systems (e.g. a CD image with exclusively a &lt;em&gt;HFS+&lt;/em&gt; file system) as well as in various hybrid configurations (e.g. &lt;a href=&quot;http://www.afterdawn.com/glossary/term.cfm/udf_bridge&quot;&gt;&lt;em&gt;UDF Bridge&lt;/em&gt;&lt;/a&gt; format).&lt;/p&gt;

&lt;p&gt;Details on how &lt;em&gt;Isolyzer&lt;/em&gt; performs the calculations for estimating the expected image size in each of these situations can be found &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer#calculation-of-the-expected-file-size&quot;&gt;here&lt;/a&gt;. As it turns out, for UDF this calculation is not as straightforward as one would expect. The result is that &lt;em&gt;Isolyzer&lt;/em&gt; typically under-estimates the true image size by several sectors. In most cases this is unlikely to be a problem; nevertheless, it may be possible to improve this in future versions.&lt;/p&gt;

&lt;h2 id=&quot;changes-to-the-output-format&quot;&gt;Changes to the output format&lt;/h2&gt;

&lt;p&gt;The addition of multiple file system support required some changes to &lt;em&gt;Isolyzer&lt;/em&gt;’s output format. The main change is the addition of a &lt;em&gt;fileSystems&lt;/em&gt; element, which in turn holds one or more &lt;em&gt;fileSystem&lt;/em&gt; elements that each contain information about a detected file system. The updated output format is &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer#isolyzer-output&quot;&gt;documented here&lt;/a&gt;, and some examples are &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer#examples&quot;&gt;available here&lt;/a&gt;. Note that these changes may break existing workflows that use &lt;em&gt;Isolyzer&lt;/em&gt;. This is also the main reason for making this a major (1.x) release. It is still possible that the format will change somewhat in the final (stable) release; this will largely depend on any feedback we may get from users of the tool.&lt;/p&gt;

&lt;h2 id=&quot;documentation-and-test-images&quot;&gt;Documentation and test images&lt;/h2&gt;

&lt;p&gt;Finally &lt;em&gt;Isolyzer&lt;/em&gt;’s documentation has been given a major overhaul, as you can see from &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/&quot;&gt;the main Github page&lt;/a&gt;. The repo now includes a &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/tree/master/testFiles&quot;&gt;new set of small test images&lt;/a&gt; that cover all of the currently supported file systems (with the exception of &lt;em&gt;HFS&lt;/em&gt;/&lt;em&gt;HFS+&lt;/em&gt; file systems with a partition map, for which I’ve been unable to create or find a sufficiently small sample).&lt;/p&gt;

&lt;h2 id=&quot;installation&quot;&gt;Installation&lt;/h2&gt;

&lt;p&gt;You can install &lt;em&gt;Isolyzer&lt;/em&gt; with &lt;a href=&quot;https://en.wikipedia.org/wiki/Pip_(package_manager)&quot;&gt;&lt;em&gt;pip&lt;/em&gt;&lt;/a&gt;, using the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;isolyzer &lt;span class=&quot;nt&quot;&gt;--user&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For Windows users &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/releases/tag/1.0.0&quot;&gt;64- and 32-bit binaries are available here&lt;/a&gt;. These binaries are completely stand-alone and don’t require Python on your machine.&lt;/p&gt;

&lt;h2 id=&quot;feedback-appreciated&quot;&gt;Feedback appreciated&lt;/h2&gt;

&lt;p&gt;As always feedback on &lt;em&gt;Isolyzer&lt;/em&gt; is highly appreciated, if only because any problem its users come across we’ll probably run into ourselves at some point! (Please note though that because of upcoming holidays I may be a little slow in following up any queries.)&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/&quot;&gt;&lt;em&gt;Isolyzer&lt;/em&gt; on Github&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/releases/tag/1.0.0&quot;&gt;Windows binaries (64- and 32-bit)&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;postscript-2-february-2019&quot;&gt;Postscript (2 February 2019)&lt;/h2&gt;

&lt;p&gt;Removed some outdated and possibly confusing information from the “Installation” section.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2017/07/12/update-on-isolyzer-udf-hfs-and-more/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2017/07/12/update-on-isolyzer-udf-hfs-and-more</link>
                <guid>https://bitsgalore.org/2017/07/12/update-on-isolyzer-udf-hfs-and-more</guid>
                <pubDate>2017-07-12T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Image and Rip Optical Media Like A Boss!</title>
                <description>&lt;p&gt;Over the last months we’ve been working on the development of a provisional workflow for preserving the content of  optical media in our collection. The main result thus far is &lt;a href=&quot;https://github.com/KBNLresearch/iromlab&quot;&gt;&lt;em&gt;Iromlab&lt;/em&gt;&lt;/a&gt;, a custom workflow application that streamlines the imaging and ripping process. This blogpost gives an overview of &lt;em&gt;Iromlab&lt;/em&gt;, as well as the reasons why we created it in the first place.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;the-challenges&quot;&gt;The challenges&lt;/h2&gt;

&lt;p&gt;The first challenge is the sheer size of the collection: somewhere between 15,000 and 20,000 discs according to our latest estimates. Processing such volumes efficiently requires a workflow that is highly automated. Secondly, whereas the majority of optical discs in our collection are CD-ROMs, about 25% are audio CDs, which need to be processed quite differently. From the onset it was clear that the workflow should be able to handle both types of discs (we also have a significant number of DVDs as well, but these are processed in a similar manner to CD-ROMs). Finally, it is crucial that the disc images remain linked (through metadata) to their respective records in our catalogue. A particular quirk here is that in most cases, a catalogue record does not describe an individual disc. Often the discs are supplemental to a physical (paper) book. In that case the catalogue record describes the book, with a metadata field indicating that the book contains, for example, a CD-ROM. Another common situation is that the catalogue record describes multiple discs. For example, it may describe a language course that is made up of one DVD, 2 CD-ROMs and 3 audio CDs. In all these cases, there is a &lt;a href=&quot;https://en.wikipedia.org/wiki/One-to-many_(data_model)&quot;&gt;one-to-many&lt;/a&gt; relationship between the catalogue record and the entities (books, optical discs) that it describes. Because of this, it is important that the workflow retains a record of the sequential order of discs within one catalogue record.&lt;/p&gt;

&lt;h2 id=&quot;disc-robot&quot;&gt;Disc robot&lt;/h2&gt;

&lt;p&gt;The level of automation that is required for this job implied we would need to use some kind of disc robot. Based on &lt;a href=&quot;https://arxiv.org/abs/1309.4932&quot;&gt;earlier work by colleagues at the British Library&lt;/a&gt;, we bought an &lt;a href=&quot;http://www.acronova.com/product/auto-blu-ray-duplicator-publisher-ripper-nimbie-usb-nb21/9/review.html&quot;&gt;&lt;em&gt;Acronova Nimbie&lt;/em&gt;&lt;/a&gt; machine (NB21-DVD type) for evaluation. The first thing that became obvious was that the software bundled with the machine has many limitations, making it unsuitable for our needs. Somewhat disappointingly, the bundle does not include any tools or an API to operate the disc loading mechanism (load, unload or reject a disc). However, a quick test with the &lt;a href=&quot;https://www.dbpoweramp.com/batch-ripper.htm&quot;&gt;&lt;em&gt;dBpoweramp&lt;/em&gt; batch ripper software&lt;/a&gt; (part of &lt;a href=&quot;https://www.dbpoweramp.com/&quot;&gt;&lt;em&gt;dBpoweramp&lt;/em&gt;&lt;/a&gt;, which had been on top of our shortlist of candidate tools for audio ripping) revealed that this software includes command-line driver tools for loading, unloading and rejecting. This makes it possible to operate the disc robot from a custom-built script (which wraps around these command-line tools). It also enables one to use the machine with pretty much &lt;em&gt;any&lt;/em&gt; extraction or audio ripping software that has a command-line interface: after loading a disc, the disc robot’s optical drive can be accessed like any ordinary (internal or external) drive. This sparked the development of &lt;a href=&quot;https://github.com/KBNLresearch/iromlab&quot;&gt;&lt;em&gt;Iromlab&lt;/em&gt;&lt;/a&gt; (Image and Rip Optical Media Like A Boss), a custom workflow application to streamline the imaging and ripping process.&lt;/p&gt;

&lt;h2 id=&quot;main-workflow-components&quot;&gt;Main workflow components&lt;/h2&gt;

&lt;p&gt;The workflow is built around a number of tried and tested software components. For the disc type identification it uses &lt;a href=&quot;https://linux.die.net/man/1/cd-info&quot;&gt;&lt;em&gt;cd-info&lt;/em&gt;&lt;/a&gt;, which is part of &lt;a href=&quot;https://www.gnu.org/software/libcdio/&quot;&gt;&lt;em&gt;libcdio&lt;/em&gt;&lt;/a&gt;, the “GNU Compact Disc Input and Control Library”. Extraction of data tracks from CD-ROMs and DVDs is done with &lt;a href=&quot;https://www.isobuster.com/&quot;&gt;&lt;em&gt;IsoBuster&lt;/em&gt;&lt;/a&gt;. Audio ripping is done with &lt;a href=&quot;https://www.dbpoweramp.com/&quot;&gt;&lt;em&gt;dBpoweramp&lt;/em&gt;&lt;/a&gt;. Since &lt;em&gt;dBpoweramp&lt;/em&gt; only has a graphical interface, we contacted its author, and through a small development contract with the KB he wrote a &lt;a href=&quot;https://github.com/KBNLresearch/iromlab/tree/master/dBpowerampconsolerip&quot;&gt;command-line tool&lt;/a&gt; for the core ripping software. This enabled us to to integrate &lt;em&gt;dBpoweramp&lt;/em&gt; into the workflow as well. Ripped audio files are verified for completeness using either &lt;a href=&quot;http://www.etree.org/shnutils/shntool/&quot;&gt;&lt;em&gt;Shntool&lt;/em&gt;&lt;/a&gt; (WAVE format) or &lt;a href=&quot;https://xiph.org/flac/&quot;&gt;&lt;em&gt;flac&lt;/em&gt;&lt;/a&gt; (Flac format). ISO images are checked with &lt;a href=&quot;/2017/01/13/detecting-broken-iso-images-introducing-isolyzer&quot;&gt;&lt;em&gt;Isolyzer&lt;/em&gt;&lt;/a&gt;. Finally, operation of the disc robot is done using the &lt;em&gt;dBpoweramp&lt;/em&gt; driver tools.&lt;/p&gt;

&lt;h2 id=&quot;architecture&quot;&gt;Architecture&lt;/h2&gt;

&lt;p&gt;The workflow (which is written in Python) integrates the above components. The figure below gives an overview:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2017/06/iromlabArchitectureSmall.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In short, &lt;em&gt;Iromlab&lt;/em&gt; consists of a main module (&lt;em&gt;iromlab.pyw&lt;/em&gt;) that defines a graphical interface that is used to enter data about each disc. For each disc the entered data are sent as job files to a file-based processing queue. These jobs are monitored by a worker module (&lt;em&gt;cdworker.py&lt;/em&gt;), which orchestrates the main heavy-lifting (disc identification, imaging, etcetera). The jobs are processed in &lt;a href=&quot;https://en.wikipedia.org/wiki/FIFO_(computing_and_electronics)&quot;&gt;“First In First Out”&lt;/a&gt; order. As the physical discs are inserted into the machine in the same order, this ensures that the workflow keeps track of the relationship between the discs and the entered data (in the absence of data entry errors!). Since the worker module runs as a subprocess (separate thread from the data entry interface), new discs can be loaded continuously while the imaging/ripping process is ongoing. The overall goal was to make the workflow as simple as possible to use for an operator, which meant that manual data entry is reduced to a minimum.&lt;/p&gt;

&lt;h2 id=&quot;batch-structure&quot;&gt;Batch structure&lt;/h2&gt;

&lt;p&gt;All output of an &lt;em&gt;Iromlab&lt;/em&gt; session is written to a batch. Each disc is represented by a directory which contains the extracted data (ISO image, WAVE/Flac files), log files of the extraction software, &lt;em&gt;cd-info&lt;/em&gt; output and a SHA-512 checksum file. In addition, each batch has a &lt;em&gt;batch manifest&lt;/em&gt;, which is a comma-delimited file that contains all information that is needed to process the batch into ingest-ready &lt;a href=&quot;https://documents.clockss.org/index.php?title=Definition_of_SIP&quot;&gt;Submission Information Packages&lt;/a&gt; (SIPs) further down the processing chain. This includes a simple “success” flag that indicates whether the imaging of a disc was successful. Finally, the batch contains a log file with detailed output about the individual processes in the workflow. For an example of the batch structure, check out this small &lt;a href=&quot;https://github.com/KBNLresearch/iromlabDemobatch&quot;&gt;online demo batch&lt;/a&gt;. Note that for copyright and file size reasons, all ISO and FLAC files from the original batch are replaced by empty (zero-byte) files in this demo.&lt;/p&gt;

&lt;h2 id=&quot;data-entry&quot;&gt;Data entry&lt;/h2&gt;

&lt;p&gt;Upon startup, &lt;em&gt;Iromlab&lt;/em&gt; launches a simple graphical interface that is used for data entry. An operator then presses a button to create a new batch. Once the batch is initialised, the following fields need to be entered for each disc:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2017/06/iromAllesBestandsformaten.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here, &lt;em&gt;PPN&lt;/em&gt; is the identifier of the catalogue record that is associated with the disc. &lt;em&gt;Volume number&lt;/em&gt; defines the sequential order in case multiple discs are associated with one record (e.g. a CD box set). Note that &lt;em&gt;Carrier type&lt;/em&gt; does not influence the imaging or ripping process; it is only used to describe the disc at the metadata level. After pressing “Submit”, &lt;em&gt;Iromlab&lt;/em&gt; looks up the entered &lt;em&gt;PPN&lt;/em&gt; identifier in the KB catalogue using an &lt;a href=&quot;https://en.wikipedia.org/wiki/Search/Retrieve_via_URL&quot;&gt;&lt;em&gt;SRU&lt;/em&gt;&lt;/a&gt; query, and then prompts the operator to confirm the publication title that was returned:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2017/06/iromConfirmTitle.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After confirming this, &lt;em&gt;Iromlab&lt;/em&gt; prompts the operator to insert the disc into the disc robot:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2017/06/loadDisc.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This second prompt tries to enforce a fixed processing order (i.e. enter the data first, and then insert the disc), which reduces the risk of synchronisation errors between the jobs queue and the stack of physical discs in the disc robot. After pressing “OK”, the job is added to the processing queue.&lt;/p&gt;

&lt;figure class=&quot;image&quot;&gt;
  &lt;img src=&quot;/images/2017/06/workstationSmall.jpg&quot; alt=&quot;Photograph of Iromlab with disc robot&quot; /&gt;
  &lt;figcaption&gt;Iromlab in action; disc robot on the left.&lt;/figcaption&gt;
&lt;/figure&gt;

&lt;h2 id=&quot;processing-a-disc&quot;&gt;Processing a disc&lt;/h2&gt;

&lt;p&gt;Meanwhile the worker module continuously scans the processing queue for job files. Once it picks up a job, it reads the information from the job file, and  tries to load a new disc. The processing of each disc then involves the following steps:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Analyse the disc with &lt;em&gt;cd-info&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;Record information about the sector layout of the physical disc (this is needed to ensure ISO images extracted from &lt;a href=&quot;/2017/04/25/imaging-cd-extra-blue-book-discs&quot;&gt;CD-Extra / Bluebook&lt;/a&gt; discs are accessible).&lt;/li&gt;
  &lt;li&gt;Rip any audio tracks to WAVE or FLAC; extract data tracks to ISO images.&lt;/li&gt;
  &lt;li&gt;Record information about the extraction process (logs, exit codes).&lt;/li&gt;
  &lt;li&gt;Check the integrity of the extracted/ripped files.&lt;/li&gt;
  &lt;li&gt;Record fixity information (SHA-512 checksums).&lt;/li&gt;
  &lt;li&gt;Record any information that is necessary further down the processing chain for the creation of ingest-ready SIPs.&lt;/li&gt;
  &lt;li&gt;Add an entry for the disc to the batch manifest.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;further-processing-of-the-batch&quot;&gt;Further processing of the batch&lt;/h2&gt;

&lt;p&gt;In order to create ingest-ready SIPs from a batch, some further processing is necessary. In particular:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The integrity of the batch needs to be verified (i.e. is the batch complete, were all discs imaged successfully?).&lt;/li&gt;
  &lt;li&gt;Any discs that have problems need to be removed from the batch. Since in our case we decided that each SIP will contain all discs that belong to a catalogue record, this also means that for each problematic disc, &lt;em&gt;all&lt;/em&gt; discs that belong to its corresponding catalogue record are removed (i.e. moved to a separate “error batch”).&lt;/li&gt;
  &lt;li&gt;If the batch passed the verification step, it can be transformed into ingest-ready SIPs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The above steps are outside of &lt;em&gt;Iromlab&lt;/em&gt;’s scope, but they are covered by the separate &lt;a href=&quot;https://github.com/KBNLresearch/omSipCreator&quot;&gt;&lt;em&gt;omSipCreator&lt;/em&gt;&lt;/a&gt; tool. This is a work in progress that still needs some refinements, and it might be the subject of a follow-up blog.&lt;/p&gt;

&lt;h2 id=&quot;using-iromlab-outside-the-kb&quot;&gt;Using Iromlab outside the KB&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Iromlab&lt;/em&gt; is tailored specifically to the situation at the KB, and it is not meant to be a general-purpose workflow solution. Nevertheless, others might find (parts of) the software useful. For instance, it would be quite straightforward to adapt those parts of the software that query the KB catalogue to other institutional catalogues or databases. Also, even though &lt;em&gt;Iromlab&lt;/em&gt; is tailor-made to a very specific combination of hardware and software, most of the software dependencies are implemented using simple wrapper modules. Adding new ones for alternative (imaging or ripping) tools would be pretty easy.&lt;/p&gt;

&lt;h2 id=&quot;documentation&quot;&gt;Documentation&lt;/h2&gt;

&lt;p&gt;User documentation for &lt;em&gt;Iromlab&lt;/em&gt; is &lt;a href=&quot;https://github.com/KBNLresearch/iromlab/tree/master/doc&quot;&gt;available here&lt;/a&gt;. It includes a &lt;a href=&quot;https://github.com/KBNLresearch/iromlab/blob/master/doc/userGuide.md&quot;&gt;User Guide&lt;/a&gt;, as well as detailed instructions on how to &lt;a href=&quot;https://github.com/KBNLresearch/iromlab/blob/master/doc/setupIromlab.md&quot;&gt;install the software&lt;/a&gt;, and how to setup and configure &lt;a href=&quot;https://github.com/KBNLresearch/iromlab/blob/master/doc/setupIsobuster.md&quot;&gt;&lt;em&gt;IsoBuster&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;https://github.com/KBNLresearch/iromlab/blob/master/doc/setupDbpoweramp.md&quot;&gt;&lt;em&gt;dBpoweramp&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks are due to &lt;em&gt;dBpoweramp&lt;/em&gt; creator “Mr. Spoon” for his work on the &lt;em&gt;dBpoweramp&lt;/em&gt; command-line tool, and for allowing us to re-distribute the binaries. &lt;em&gt;IsoBuster&lt;/em&gt;’s author Peter van Hove is thanked for some modifications he made to &lt;em&gt;IsoBuster&lt;/em&gt;’s command-line interface.&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/iromlab&quot;&gt;&lt;em&gt;Iromlab&lt;/em&gt; on Github&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/iromlab/tree/master/doc&quot;&gt;Documentation&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/iromlabDemobatch&quot;&gt;&lt;em&gt;Iromlab&lt;/em&gt; demo batch&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2017/06/19/image-and-rip-optical-media-like-a-boss/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2017/06/19/image-and-rip-optical-media-like-a-boss</link>
                <guid>https://bitsgalore.org/2017/06/19/image-and-rip-optical-media-like-a-boss</guid>
                <pubDate>2017-06-19T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Policy-based assessment with VeraPDF - a first impression</title>
                <description>&lt;p&gt;Some four years ago I wrote &lt;a href=&quot;/2013/07/25/identification-pdf-preservation-risks-sequel&quot;&gt;a blog post&lt;/a&gt; that demonstrated how &lt;em&gt;Apache Preflight&lt;/em&gt; (the PDF/A validator tool that is part of &lt;a href=&quot;https://pdfbox.apache.org/&quot;&gt;&lt;em&gt;Apache PDFBox&lt;/em&gt;&lt;/a&gt;) can be used to detect features in a PDF that are potential preservation risks. A &lt;a href=&quot;//2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus&quot;&gt;follow-up blog&lt;/a&gt; applied &lt;a href=&quot;https://en.wikipedia.org/wiki/Schematron&quot;&gt;&lt;em&gt;Schematron&lt;/em&gt;&lt;/a&gt; rules to the &lt;em&gt;Preflight&lt;/em&gt; output in an attempt at doing policy-based assessments. The results of that work were quite promising, but dealing with Preflight’s multitude of (especially font-related) validation errors proved to be a challenge.&lt;/p&gt;

&lt;p&gt;The idea of using a &lt;em&gt;PDF/A&lt;/em&gt; validor for policy-based assessments of “regular” &lt;em&gt;PDF&lt;/em&gt; files (i.e. &lt;em&gt;PDF&lt;/em&gt;s that are not necessarily &lt;em&gt;PDF/A&lt;/em&gt;) was explicitly addressed as a use case for &lt;a href=&quot;http://verapdf.org/&quot;&gt;&lt;em&gt;veraPDF&lt;/em&gt;&lt;/a&gt;. With &lt;em&gt;VeraPDF&lt;/em&gt; now having entered its “final testing phase”, I thought this was a good time for a small test-drive of &lt;em&gt;veraPDF&lt;/em&gt;’s capabilities in this area. All test results are based on &lt;em&gt;VeraPDF&lt;/em&gt; 1.4.7.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;test-data&quot;&gt;Test data&lt;/h2&gt;

&lt;p&gt;For this test I used &lt;em&gt;PDF&lt;/em&gt;s from the &lt;a href=&quot;https://web.archive.org/web/20130503115947/http://acroeng.adobe.com/wp/&quot;&gt;&lt;em&gt;Adobe Acrobat Engineering&lt;/em&gt; website&lt;/a&gt; (sadly gone since 2015). As in my 2013 blog post, I limited the analysis to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;all files in the &lt;em&gt;General&lt;/em&gt; section of the &lt;a href=&quot;https://web.archive.org/web/20150228065249/http://acroeng.adobe.com:80/wp/?page_id=101&quot;&gt;&lt;em&gt;Font Testing&lt;/em&gt;&lt;/a&gt; category;&lt;/li&gt;
  &lt;li&gt;all files in the &lt;em&gt;Classic Multimedia&lt;/em&gt; section of the &lt;a href=&quot;https://web.archive.org/web/20150228104639/http://acroeng.adobe.com:80/wp/?page_id=61&quot;&gt;&lt;em&gt;Multimedia &amp;amp; 3D Tests&lt;/em&gt;&lt;/a&gt; category.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dataset is quite small, but contains many complex and otherwise challenging &lt;em&gt;PDF&lt;/em&gt;s, which make it an interesting dataset for testing.&lt;/p&gt;

&lt;h2 id=&quot;policy&quot;&gt;Policy&lt;/h2&gt;

&lt;p&gt;The policy is similar to the one used in &lt;a href=&quot;/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus&quot;&gt;my 2014 blog post&lt;/a&gt;, and it is defined by the following objectives:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;No encryption / password protection&lt;/li&gt;
  &lt;li&gt;All fonts are embedded&lt;/li&gt;
  &lt;li&gt;No embedded files&lt;/li&gt;
  &lt;li&gt;No file attachments&lt;/li&gt;
  &lt;li&gt;No multimedia content (audio, video, 3-D objects)&lt;/li&gt;
  &lt;li&gt;No PDFs that raise an exception or result in a processing error in &lt;em&gt;VeraPDF&lt;/em&gt; (&lt;em&gt;PDF&lt;/em&gt; validity proxy)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;(Note that the 2014 blog post also mentioned the absence of &lt;em&gt;JavaScript&lt;/em&gt; as an additional objective. However, it turned out that the necessary output for this is not currently reported by &lt;em&gt;VeraPDF&lt;/em&gt;.)&lt;/p&gt;

&lt;p&gt;Subsequently I ‘translated’ each of these objectives into &lt;em&gt;Schematron&lt;/em&gt; rules. For a basic &lt;em&gt;how-to&lt;/em&gt; see the  &lt;a href=&quot;http://docs.verapdf.org/policy/&quot;&gt;&lt;em&gt;veraPDF Policy Checking&lt;/em&gt; documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The full &lt;em&gt;Schematron&lt;/em&gt; file can be found &lt;a href=&quot;https://github.com/KBNLresearch/pdfPolicyVeraPDF/blob/master/schemas/demo-policy.sch&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;verapdf-configuration&quot;&gt;VeraPDF configuration&lt;/h2&gt;

&lt;p&gt;It is important to note that, unlike in my earlier &lt;em&gt;Apache Preflight&lt;/em&gt; experiments, the &lt;em&gt;Schematron&lt;/em&gt; rules do not rely on the &lt;em&gt;PDF/A&lt;/em&gt; validation output! Instead, &lt;em&gt;VeraPDF&lt;/em&gt; can be instructed to include a ‘features report’ in its output, which directly points to technical features such as font properties, annotation types, security features, and so on. Most of the features that are needed for a policy-based assessment are disabled by default. So, we first need to activate these in the configuration (file &lt;em&gt;features.xml&lt;/em&gt; in &lt;em&gt;VeraPDF&lt;/em&gt;’s &lt;em&gt;config&lt;/em&gt; directory). I edited it as below:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;yes&quot;?&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;featuresConfig&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;enabledFeatures&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;ANNOTATION&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;DOCUMENT_SECURITY&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;EMBEDDED_FILE&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;FONT&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;feature&amp;gt;&lt;/span&gt;INFORMATION_DICTIONARY&lt;span class=&quot;nt&quot;&gt;&amp;lt;/feature&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/enabledFeatures&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/featuresConfig&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;basic-operation&quot;&gt;Basic operation&lt;/h2&gt;

&lt;p&gt;Supposing that the &lt;em&gt;PDF&lt;/em&gt;s we want to analyze are in directory &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~/myPdfs&lt;/code&gt;, and that the &lt;em&gt;Schematron&lt;/em&gt; rules that represent our policy are in the file &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;demo-policy.sch&lt;/code&gt;, we can do a policy-based validation of all these files with one single command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;verapdf &lt;span class=&quot;nt&quot;&gt;-x&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--policyfile&lt;/span&gt; demo-policy.sch ~/myPdfs/&lt;span class=&quot;k&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; myPdfsOut.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-x&lt;/code&gt; switch activates feature extraction. The output file &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;myPdfsOut.xml&lt;/code&gt; contains, for each &lt;em&gt;PDF&lt;/em&gt;, an element with &lt;em&gt;PDF/A&lt;/em&gt; validation output, an element with the features report, and an element with the policy report.&lt;/p&gt;

&lt;h2 id=&quot;analysis-script&quot;&gt;Analysis script&lt;/h2&gt;

&lt;p&gt;Typically the &lt;em&gt;VeraPDF&lt;/em&gt; output is rather unwieldy. To facilitate things I wrote a &lt;a href=&quot;https://github.com/KBNLresearch/pdfPolicyVeraPDF&quot;&gt;custom analysis script&lt;/a&gt;, which does the following things:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It runs &lt;em&gt;VeraPDF&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;It creates a &lt;a href=&quot;https://github.com/KBNLresearch/pdfPolicyVeraPDF/blob/master/examples/fonts_san.xml&quot;&gt;trimmed-down version of the output file&lt;/a&gt; that only contains the policy report. Also, for each PDF, it removes duplicate instances of failed (policy) checks (e.g. if a check on font embedding fails for 10 different fonts, only one reference to the failed check is kept)&lt;/li&gt;
  &lt;li&gt;It creates a &lt;a href=&quot;https://github.com/KBNLresearch/pdfPolicyVeraPDF/blob/master/examples/fonts_summary.csv&quot;&gt;comma-delimited summary file&lt;/a&gt; which lists for each PDF its path/name, followed by the description of each unique failed validation rule (taken from the &lt;em&gt;message&lt;/em&gt; element in &lt;em&gt;VeraPDF&lt;/em&gt;’s output).&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;running-the-analysis&quot;&gt;Running the analysis&lt;/h2&gt;

&lt;p&gt;For this analysis I ran the above script for both the &lt;em&gt;fonts&lt;/em&gt; and &lt;em&gt;multimedia&lt;/em&gt; files, using the following command line (here for the &lt;em&gt;fonts&lt;/em&gt; files):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;~/pdfPolicyVeraPDF/policyValidate.sh /home/johan/pdfAcrobatEngineering/fonts /home/johan/pdfPolicyVeraPDF/schemas/demo-policy.sch fonts
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;results-fonts-category&quot;&gt;Results, fonts category&lt;/h2&gt;

&lt;p&gt;The following table lists, for each &lt;em&gt;PDF&lt;/em&gt; in the &lt;em&gt;fonts&lt;/em&gt; category, the corresponding (unique) validation errors (taken from the summary CSV file). Note that the text strings in the right column correspond to text values in the &lt;em&gt;assert&lt;/em&gt; elements of the policy file.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Test file&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Failed assert(s)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;EmbeddedCmap.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;embedded_fonts.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;embedded_pm65.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;notembedded_pm65.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;printtestfont_nonopt.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;printtestfont_opt.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;substitution_fonts.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;text_images_pdf1.2.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;TEXT.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Type3_WWW-HTML.PDF&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;These results show that most of the &lt;em&gt;PDF&lt;/em&gt;s fail our policy on the font embedding objective.&lt;/p&gt;

&lt;h2 id=&quot;results-multimedia-category&quot;&gt;Results, multimedia category&lt;/h2&gt;

&lt;p&gt;Similarly, below are the results for the &lt;em&gt;multimedia&lt;/em&gt; category:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Test file&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Failed assert(s)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;20020402_CALOS.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Movie annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3-D_PDF.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AdobeChassisDemo-commented.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AdobeChassisDemo-commented_Review.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;AVI+Transitions Demo.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Document not parsable&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Binder_6-3DPages.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3D annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Disney-Flash.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Screen annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;drape_raster_contour_sample.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;3D annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;gXsummer2004-stream.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Document not parsable&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Jpeg_linked.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Encrypted document;Document not parsable&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;LabelExample.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Encrypted document;Document not parsable&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;movie_down1.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Movie annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;movie.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Movie annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;MultiMedia_Acro6.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Encrypted document;Document not parsable&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;MusicalScore.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Screen annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;phlmapbeta7.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Screen annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;remotemovieurl.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Movie annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ScriptEvents.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Screen annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Service Form_media.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Screen annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SVG-AnnotAnim.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;SVG.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Trophy.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font is not embedded;Screen annotation&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;us_population.pdf&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Here the reasons for failing the policy are more diverse. Many of these &lt;em&gt;PDF&lt;/em&gt;s contain &lt;em&gt;Screen&lt;/em&gt;, &lt;em&gt;Movie&lt;/em&gt; or &lt;em&gt;3D&lt;/em&gt; annotations. Non-embedded fonts are common as well. Three &lt;em&gt;PDF&lt;/em&gt;s were not parsable because of encryption. This turns out to be a &lt;a href=&quot;https://github.com/veraPDF/veraPDF-apps/issues/202&quot;&gt;a bug&lt;/a&gt; that is fixed in newer versions of &lt;em&gt;VeraPDF&lt;/em&gt;. Two files (&lt;em&gt;AVI+Transitions Demo.pdf&lt;/em&gt; and &lt;em&gt;gXsummer2004-stream.pdf&lt;/em&gt;) were not parsable at all. These files could not be opened in Adobe Acrobat either. Finally, one 49 MB file (which is not listed in the table) resulted in an out-of-memory error that crashed &lt;em&gt;VeraPDF&lt;/em&gt; altogether. I &lt;a href=&quot;https://github.com/veraPDF/veraPDF-apps/issues/195&quot;&gt;reported this as a bug&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;general-observations&quot;&gt;General observations&lt;/h2&gt;

&lt;p&gt;First of all I was impressed with the amount of detailed information that &lt;em&gt;VeraPDF&lt;/em&gt; can provide of a &lt;em&gt;PDF&lt;/em&gt; file. I was also pleasantly surprised at the relative ease of doing policy-based assessments. This is mainly thanks to &lt;em&gt;VeraPDF&lt;/em&gt;’s features report, which allows one to address features such as specific annotation types directly. During my earlier attempts at policy-based assessment with &lt;em&gt;Apache Preflight&lt;/em&gt;, the detection of non-embedded fonts was particularly difficult (have a look at the &lt;a href=&quot;https://github.com/openpreserve/pdfPolicyValidate/blob/master/schemas/pdf_policy_preflight_test.sch#L55&quot;&gt;Schematron file&lt;/a&gt; to see what I mean). With &lt;em&gt;VeraPDF&lt;/em&gt; this only needs &lt;a href=&quot;https://github.com/KBNLresearch/pdfPolicyVeraPDF/blob/master/schemas/demo-policy.sch#L42&quot;&gt;one single line&lt;/a&gt; (though admittedly this probably means that errors related to damaged or malformed fonts won’t be reported).
Thanks to &lt;em&gt;VeraPDF&lt;/em&gt;’s built-in functionality to do the Schematron validation, it is no longer necessary to use an external Schematron validator (though this is still possible).&lt;/p&gt;

&lt;h2 id=&quot;actions-missing-in-action&quot;&gt;Actions missing in action?&lt;/h2&gt;

&lt;p&gt;One thing I missed is the reporting of &lt;em&gt;Actions&lt;/em&gt;. Without this, it is not possible to identify &lt;em&gt;PDF&lt;/em&gt;s that contain &lt;em&gt;JavaScript&lt;/em&gt; (and some other features as well). An option to include &lt;em&gt;Actions&lt;/em&gt; in the ‘Feature Report’ would make a welcome addition. As the &lt;em&gt;PDF/A&lt;/em&gt; validation profiles already include checks on &lt;em&gt;Actions&lt;/em&gt;, this is probably pretty straightforward (see also &lt;a href=&quot;https://github.com/veraPDF/veraPDF-apps/issues/174&quot;&gt;this issue&lt;/a&gt;).&lt;/p&gt;

&lt;h2 id=&quot;writing-a-policy-file&quot;&gt;Writing a policy file&lt;/h2&gt;

&lt;p&gt;Not having worked on &lt;em&gt;PDF&lt;/em&gt;-related things for a while myself, it took me some time to figure out how to put together the (Schematron) policy file. The &lt;em&gt;VeraPDF&lt;/em&gt; &lt;a href=&quot;http://docs.verapdf.org/policy/&quot;&gt;documentation gives some guidance&lt;/a&gt;, but I couldn’t find an exhaustive description of every possible feature  in the features report. This meant I first had to run &lt;em&gt;VeraPDF&lt;/em&gt; (with feature extraction enabled) on a number of files that I &lt;em&gt;knew&lt;/em&gt; to contain certain features I wanted to include in my policy (e.g. embedded fonts, multimedia), inspect the &lt;em&gt;XML&lt;/em&gt; output, and then write my Schematron rules based on that output. As I have a pretty good knowledge of the specific &lt;em&gt;PDF&lt;/em&gt; data structures involved I was able to do this, but it did make me wonder about users who don’t have that technical knowledge. Possible solutions would be:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Additional documentation of all possible output elements in the features report. This &lt;a href=&quot;http://docs.verapdf.org/cli/feature-extraction/&quot;&gt;seems to be in the works already&lt;/a&gt; (though not complete yet)&lt;/li&gt;
  &lt;li&gt;Inclusion of some example policy files. Actually &lt;em&gt;veraPDF&lt;/em&gt;’s Github repo &lt;a href=&quot;https://github.com/veraPDF/veraPDF-policy-docs/tree/master/Schemas&quot;&gt;contains a number of these&lt;/a&gt; already, but they are not (yet) referenced by the documentation, and I only found out about them after I ran my tests.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It would also help if users of &lt;em&gt;veraPDF&lt;/em&gt; would publish and share their policy files.&lt;/p&gt;

&lt;p&gt;Finally it just occurred to me this is a good occasion to give one more bump to &lt;a href=&quot;https://doi.org/10.5281/zenodo.801661&quot;&gt;this 2009 report I wrote on long-term preservation risks of PDF&lt;/a&gt;. It explicitly lists the data structures (e.g. annotations, actions) that are associated with specific (risky) features, which might provide users some guidance as to what features are potentially interesting for inclusion in a policy.&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/pdfPolicyVeraPDF&quot;&gt;PDF policy-based validation demo, veraPDF&lt;/a&gt; - Github repo with scripts, Schematron policy file and all output files&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://doi.org/10.5281/zenodo.801661&quot;&gt;Adobe Portable Document Format - Inventory of long-term preservation risks&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression</link>
                <guid>https://bitsgalore.org/2017/06/01/policy-based-assessment-with-verapdf-a-first-impression</guid>
                <pubDate>2017-06-01T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Imaging CD-Extra / Blue Book discs</title>
                <description>&lt;p&gt;The development work on an imaging/ripping workflow for optical media is shaping up steadily, and you can expect a write-up with more information about our software and hardware setup here in the near future (you can get a sneak peek &lt;a href=&quot;https://github.com/KBNLresearch/iromlab&quot;&gt;here&lt;/a&gt;). However, this blog is about a very specific problem that we ran into while testing the workflow with a selection of discs from our collection. This selection included a few discs that follow the &lt;a href=&quot;https://en.wikipedia.org/wiki/Blue_Book_(CD_standard)&quot;&gt;&lt;em&gt;Blue Book&lt;/em&gt;&lt;/a&gt; standard (also known as &lt;em&gt;CD-Extra&lt;/em&gt;). This standard defines a method for combining audio and data tracks on one disc. A &lt;em&gt;CD-Extra&lt;/em&gt; disc contains two sessions, where the first session holds all audio tracks, and the second session holds a data track. &lt;em&gt;Blue Book&lt;/em&gt; was (and still is) widely used for audio CD’s with bonus videos or software.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;In our workflow, these discs are handled in the following way:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Identify the audio and data sessions with the &lt;a href=&quot;https://linux.die.net/man/1/cd-info&quot;&gt;&lt;em&gt;cd-info&lt;/em&gt;&lt;/a&gt; tool (part of the &lt;a href=&quot;https://www.gnu.org/software/libcdio/&quot;&gt;&lt;em&gt;libcdio&lt;/em&gt;&lt;/a&gt; library).&lt;/li&gt;
  &lt;li&gt;Rip the audio tracks in the first session to &lt;em&gt;WAVE&lt;/em&gt; or &lt;em&gt;FLAC&lt;/em&gt; files using &lt;a href=&quot;https://www.dbpoweramp.com/&quot;&gt;&lt;em&gt;dBpoweramp&lt;/em&gt;&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Verify the audio files for completeness with &lt;a href=&quot;http://www.etree.org/shnutils/shntool/&quot;&gt;&lt;em&gt;shntool&lt;/em&gt;&lt;/a&gt; (&lt;em&gt;WAVE&lt;/em&gt;) or &lt;a href=&quot;https://xiph.org/flac/&quot;&gt;&lt;em&gt;flac&lt;/em&gt;&lt;/a&gt; (&lt;em&gt;FLAC&lt;/em&gt;).&lt;/li&gt;
  &lt;li&gt;Extract the data track in the second session to an ISO image with &lt;a href=&quot;https://www.isobuster.com/&quot;&gt;&lt;em&gt;IsoBuster&lt;/em&gt;&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;Verify the ISO image for completeness with &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;&lt;em&gt;isolyzer&lt;/em&gt;&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;size-of-iso-image-smaller-than-expected&quot;&gt;Size of ISO image smaller than expected&lt;/h2&gt;

&lt;p&gt;All &lt;em&gt;Blue Book&lt;/em&gt; discs in our test passed steps 1 through 4 without any issues, but failed the final &lt;em&gt;isolyzer&lt;/em&gt; check. This &lt;em&gt;isolyzer&lt;/em&gt; check involves a comparison of the file size of the ISO image against the &lt;em&gt;expected&lt;/em&gt; size, as calculated from the image’s Primary Volume Descriptor fields (and Apple HFS blocks, if present). If the actual size is smaller than the expected size, this indicates the image is incomplete. For &lt;em&gt;all&lt;/em&gt; ISO images that were extracted from a &lt;em&gt;Blue Book&lt;/em&gt; disc, the actual file size was significantly smaller than  expected. Moreover, the images couldn’t be mounted in Linux, or opened in file archiver software (e.g. 7-Zip). Below is an excerpt from the &lt;em&gt;isolyzer&lt;/em&gt; output of one of the offending images:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;tests&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsISO9660Signature&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsISO9660Signature&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsApplePartitionMap&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsApplePartitionMap&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleHFSHeader&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleHFSHeader&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;parsedAppleZeroBlock&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/parsedAppleZeroBlock&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeExpected&amp;gt;&lt;/span&gt;609912832&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeActual&amp;gt;&lt;/span&gt;554373120&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeActual&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeDifference&amp;gt;&lt;/span&gt;-55539712&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeDifference&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeAsExpected&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeAsExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;smallerThanExpected&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/smallerThanExpected&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/tests&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;So, in this case the ISO image is about 56 MB smaller than expected, even though &lt;em&gt;IsoBuster&lt;/em&gt; did not report any errors during the extraction process. One thing that caught my eye after running a few of these problematic images through &lt;em&gt;isolyzer&lt;/em&gt;, was that the value of &lt;em&gt;sizeDifference&lt;/em&gt; always roughly corresponded to the size of the uncompressed audio on the disc. The &lt;em&gt;sizeExpected&lt;/em&gt; value is calculated from the &lt;em&gt;Volume Space Size&lt;/em&gt; field (which defines the number of logical blocks in the ISO 9660 file system) in the ISO’s Primary Volume Descriptor. This made me wonder: does the &lt;em&gt;Volume Space Size&lt;/em&gt; value really reflect the number of sectors occupied by the &lt;em&gt;data track&lt;/em&gt; (which I was quietly assuming), or does it perhaps reflect &lt;em&gt;all sectors on the disc&lt;/em&gt; (including those in the audio session)?&lt;/p&gt;

&lt;p&gt;Finding the answer to this question was more difficult than I expected, but my initial suspicions were confirmed by this entry on the Debian mailing list:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://lists.debian.org/debian-user/2005/01/msg02339.html&quot;&gt;Re: reading the raw iso from a CD-Extra (multisession CD)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Which states:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[T]he sector numbers in the file system refer[s] to sectors of the original CD rather than sectors of session2.iso.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Similarly, from this thread on the &lt;em&gt;libcdio&lt;/em&gt; mailing list:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://lists.gnu.org/archive/html/libcdio-devel/2010-02/msg00048.html&quot;&gt;[Libcdio-devel] Retrieving DATA session from multisession audio disc&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Remember, the path table and directory structure of the iso reflect the fact that the ISO filesystem starts on sector 222145 (49:23:70) of the CD.  If it is burned to another CD at a different position, it won’t work.  Likewise, any program that reads the iso will need to be able to compensate for the offset.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This confirms that all references to blocks of data (sectors) in these ISO images are defined &lt;em&gt;relative to the start of the physical CD&lt;/em&gt;, and &lt;strong&gt;not&lt;/strong&gt; relative to the start of the ISO image! So the questions are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;How can we verify that these images are complete?&lt;/li&gt;
  &lt;li&gt;How can we access these images at all?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Using the information from (mostly) the aforementioned &lt;em&gt;libdio&lt;/em&gt; mailing list thread I was able to answer both questions. The steps below will work on most Linux-based systems (and possibly some Windows-based ones as well).&lt;/p&gt;

&lt;h2 id=&quot;cd-sector-layout&quot;&gt;Cd sector layout&lt;/h2&gt;

&lt;p&gt;As a first step we need some information on the sector layout of the physical disc, and in particular the start sector of the second session (which contains the data track). You can do this by running &lt;em&gt;cd-info&lt;/em&gt; while the disc is in the drive:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cd-info /dev/sr0 &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; leesleeuw_cdinfo.txt
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the resulting output, the start sesssion of the data session is listed twice. First look at the &lt;em&gt;CD-ROM Track List&lt;/em&gt;  section:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;CD-ROM Track List (1 - 3)
    #: MSF       LSN    Type   Green? Copy? Channels Premphasis?
    1: 00:02:00  000000 audio  false  no    2        no
    2: 01:25:24  006249 audio  false  no    2        no
    3: 06:05:46  027271 data   false  no   
170: 66:14:61  297961 leadout (668 MB raw, 668 MB formatted)
Media Catalog Number (MCN): 0000000000000
TRACK  1 ISRC: 000000000000
TRACK  2 ISRC: 000000000000
TRACK  3 ISRC: 000000000000
Last CD Session LSN: 27271
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This tells us that the CD contains three tracks, where track 1 and track 2 are audio tracks, and track 3 is a data track. The second column (heading: LSN) shows the start sector of each track; for track 3 this is 027271. This value is repeated in the &lt;em&gt;Last CD Session LSN&lt;/em&gt; (sector number of last session). Finally it can be found again in the &lt;em&gt;CD Analysis Report&lt;/em&gt; at the bottom of the file:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;session #2 starts at track  3, LSN: 27271, ISO 9660 blocks: 297809
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, in this case session 2 (which contains the data track) starts on sector 27271 of the disc.&lt;/p&gt;

&lt;h2 id=&quot;verify-iso-image-for-completeness&quot;&gt;Verify ISO image for completeness&lt;/h2&gt;

&lt;p&gt;Now that we know the start sector, we can use this to verify the ISO image. For this I added the new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--offset&lt;/code&gt; option to &lt;em&gt;isolyzer&lt;/em&gt;. This lets you specify a start sector offset, which is subtracted from the expected size estimate that is calculated from the Primary Volume Descriptor&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. So we call &lt;em&gt;isolyzer&lt;/em&gt;  like this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;isolyzer leesleeuw.iso &lt;span class=&quot;nt&quot;&gt;--offset&lt;/span&gt; 27271
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;em&gt;tests&lt;/em&gt; elements now looks like this:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;tests&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsISO9660Signature&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsISO9660Signature&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsApplePartitionMap&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsApplePartitionMap&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleHFSHeader&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleHFSHeader&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;parsedAppleZeroBlock&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/parsedAppleZeroBlock&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeExpected&amp;gt;&lt;/span&gt;554061824&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeActual&amp;gt;&lt;/span&gt;554373120&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeActual&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeDifference&amp;gt;&lt;/span&gt;311296&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeDifference&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeAsExpected&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeAsExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;smallerThanExpected&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/smallerThanExpected&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/tests&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;em&gt;isolyzer&lt;/em&gt; output now shows that the size of the ISO image is about 311 KB (152 sectors) larger than the expected value; this is completely fine.&lt;/p&gt;

&lt;h2 id=&quot;access-the-file-system&quot;&gt;Access the file system&lt;/h2&gt;

&lt;p&gt;Now that we’re (reasonably) sure the ISO image is complete, the next step is to access its contents. If the image has a &lt;a href=&quot;https://en.wikipedia.org/wiki/Hybrid_disc#Multiple_file_systems&quot;&gt;hybrid file system&lt;/a&gt;, it may be possible to mount it directly on some platforms. For instance, opening an image that contains an Apple partition with Linux Mint’s Disk Image Mounter will mount the Apple partition (but not the ISO 9660 file system, which may not necessarily point to the same files!). Fortunately, &lt;a href=&quot;https://lists.gnu.org/archive/html/libcdio-devel/2010-02/msg00050.html&quot;&gt;this message on the &lt;em&gt;libcdio&lt;/em&gt; mailing list&lt;/a&gt; by Thomas Schmitt (one of the authors of the &lt;a href=&quot;https://en.wikipedia.org/wiki/Libburnia&quot;&gt;&lt;em&gt;libburnia&lt;/em&gt;&lt;/a&gt; library) explains how to mount these images. The trick here is to insert a block of data at the start of our ISO image that corresponds to the size of the sectors that are missing from the physical disc (i.e. the sectors that are part of the first session). The effect of this is that  all sector references will again match the actual sector locations in the image.&lt;/p&gt;

&lt;p&gt;First we create a file that contains 27271 sectors that are filled with zero-bytes (note that the &lt;em&gt;seek&lt;/em&gt; position equals &lt;em&gt;Session Start Sector - 1&lt;/em&gt;):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;dd &lt;/span&gt;&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/zero &lt;span class=&quot;nv&quot;&gt;bs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;2K &lt;span class=&quot;nv&quot;&gt;count&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1 &lt;span class=&quot;nv&quot;&gt;seek&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;27270 &lt;span class=&quot;nv&quot;&gt;of&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;leesleeuw_tmp.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Next we append our ISO image to this file:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cat &lt;/span&gt;leesleeuw.iso &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt;leesleeuw_tmp.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Finally we mount the image:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;mount &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; iso9660 &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; loop,sbsector&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;27270 leesleeuw_tmp.iso /media/johan
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We can now navigate the file system with the file manager.&lt;/p&gt;

&lt;h2 id=&quot;create-access-iso&quot;&gt;Create access ISO&lt;/h2&gt;

&lt;p&gt;In order to use the ISO with an emulator or virtual machine, we have to do some additional work. In theory, modifying all sector addresses in the ISO would do the trick, but as far as I’m aware there is no software tool that is capable of this. Instead, we’ll use the mounted file system from the previous step to create a completely new ISO image. We can do this with the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;xorrisofs &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-J&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; leesleeuw_new.iso /media/johan
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Again, credits go to Thomas Schmitt who suggested this on the &lt;a href=&quot;https://lists.gnu.org/archive/html/libcdio-devel/2010-02/msg00053.html&quot;&gt;&lt;em&gt;libcdio&lt;/em&gt; mailing list&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally we check the image with &lt;em&gt;isolyzer&lt;/em&gt;:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;tests&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsISO9660Signature&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsISO9660Signature&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsApplePartitionMap&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsApplePartitionMap&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleHFSHeader&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleHFSHeader&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeExpected&amp;gt;&lt;/span&gt;508913664&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeActual&amp;gt;&lt;/span&gt;508913664&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeActual&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeDifference&amp;gt;&lt;/span&gt;0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeDifference&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeAsExpected&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeAsExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;smallerThanExpected&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/smallerThanExpected&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/tests&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I tried this approach for a couple of ISO images with old Windows installables. I mounted the images to a Windows 2000 machine running in VirtualBox, and in all cases I was able to access the images and install the software without problems.&lt;/p&gt;

&lt;h2 id=&quot;recommendations-and-caveats&quot;&gt;Recommendations and caveats&lt;/h2&gt;

&lt;p&gt;Based on some limited tests, the above approach looks useful for providing access to extracted data sessions from CD-Extra discs that would otherwise be inaccessible. There are some caveats though:&lt;/p&gt;

&lt;p&gt;First, the derived ISO images should &lt;em&gt;only&lt;/em&gt; be treated as access images, not as preservation masters! There are several reasons for this. An important one is that if the original image contained an Apple partition, the reformatting procedure will get rid of it! Since Apple partitions may point to different data than the ISO 9660 file system, this may result in loss of data (files).&lt;/p&gt;

&lt;p&gt;Second, sometimes the data session may reference the audio session. A simple example would be an audio player application that lets a user play audio tracks. This functionality will be lost in a emulated environment. I’m not aware of any solutions for this, especially given the lack of (open) disc image formats that are able to capture all data on a multisession CD. However, it is not impossible that such a format will appear at some point in the future. If that happens, it may be possible to reformat the ripped audio and data  tracks into that format, but this would only work if a full record of the sector layout of the physical disc is available.&lt;/p&gt;

&lt;p&gt;Since we need some of this information already for accessing the ISO images, the most important recommendation that follows from this work is to always keep a record of the sector layout of CD-Extra / Blue Book discs. This can be done by simple running &lt;a href=&quot;https://linux.die.net/man/1/cd-info&quot;&gt;&lt;em&gt;cd-info&lt;/em&gt;&lt;/a&gt; on each disc, and storing its output as metadata with the ripped/imaged files.&lt;/p&gt;

&lt;p&gt;Finally, I was surpprised at the complete lack of any software tool that is capable of manipulating sector offsets in an ISO image. Having such a tool would enable us to create readable access ISOs in a more straightforward way (i.e. without first having to add padding bytes to the source source image and thren having to mount the resulting image). It might also be less error-prone. I wonder if there’s any interest from the wider community to invest in the development of such a tool?&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;Isolyzer&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/isolyzer/raw/master/testFiles/multisession.iso&quot;&gt;Sample multisession ISO image&lt;/a&gt; (6 MB download; start sector of the data session is 21917)&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2017/04/25/imaging-cd-extra-blue-book-discs/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;This is analogous to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-N&lt;/code&gt; option in &lt;em&gt;cdinfo&lt;/em&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2017/04/25/imaging-cd-extra-blue-book-discs</link>
                <guid>https://bitsgalore.org/2017/04/25/imaging-cd-extra-blue-book-discs</guid>
                <pubDate>2017-04-25T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Detecting broken ISO images&#58; introducing Isolyzer</title>
                <description>&lt;p&gt;In my &lt;a href=&quot;/2017/01/04/breaking-waves-and-some-flacs&quot;&gt;previous blog post&lt;/a&gt; I addressed the detection of broken audio files in an automated workflow for ripping audio CDs. For (data) CD-ROMs and DVDs that are imaged to an ISO image, a similar problem exists: how can we be reasonably sure that the created image is complete? In this blog post I will discuss some possible ways of doing this using existing tools, along with their limitations. I then introduce &lt;em&gt;Isolyzer&lt;/em&gt;, a new tool that might be a useful addition to the existing methods.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;checksums&quot;&gt;Checksums&lt;/h2&gt;

&lt;p&gt;A  number of techniques exist to verify a newly created ISO image. A seemingly obvious solution would be to do a checksum comparison on both the ISO image and the physical carrier. For instance, the following will work on any Linux system:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;md5sum myimage.iso
md5sum /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The first line computes an MD5 checksum from the ISO image; the second line repeats this for the physical carrier. This method is not completely fail-safe. In some tests I did over a year ago, I ran into a a very strange issue where my attempts to image a CD would sometimes &lt;a href=&quot;http://qanda.digipres.org/1076/incomplete-image-after-imaging-rom-prevent-and-detect-this&quot;&gt;result in incomplete reads&lt;/a&gt;, and, as a result, truncated ISO images. The problem was most likely caused by faulty hardware (the machine on which I ran those tests more or less died shortly afterwards). Most worryingly, the machine would sometimes return incomplete data, both while creating the ISO image as well as during the subsequent checksum calculation on the physical carrier. The result of this was that the computed checksums were identical in both cases, &lt;em&gt;which meant that the image passed the checksum quality check, even though it was incomplete&lt;/em&gt;!&lt;/p&gt;

&lt;h2 id=&quot;isovfy&quot;&gt;Isovfy&lt;/h2&gt;

&lt;p&gt;The popular &lt;a href=&quot;https://en.wikipedia.org/wiki/Cdrtools&quot;&gt;&lt;em&gt;cdrtools&lt;/em&gt;&lt;/a&gt; library includes a tool called &lt;a href=&quot;http://linux.die.net/man/8/isoinfo&quot;&gt;&lt;em&gt;isovfy&lt;/em&gt;&lt;/a&gt;. Its man page describes it as follows:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;isovfy is a utility to verify the integrity of an iso9660 image. Most of the tests in isovfy were added after bugs were discovered in early versions of mkisofs. It isn’t all that clear how useful this is anymore, but it doesn’t hurt to have this around.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I already commented on this tool in an &lt;a href=&quot;/2015/11/13/preserving-optical-media-from-the-command-line/&quot;&gt;earlier blog post&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The documentation of the tool isn’t very clear about what specific checks it performs. In one of my tests I fed it an ISO image that had its last 50 MB missing (truncated). This did not result in any error or warning message! Most of the reported isovfy errors that I came across in my tests simply reflected the file system on the physical CD not conforming to ISO 9660 (this seems to be pretty common).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You can try this yourself by running &lt;em&gt;isovfy&lt;/em&gt; on the following two ISO images:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/verifyISOSize/blob/master/testFiles/minimal.iso?raw=true&quot;&gt;This is a small (350 kB) ISO image&lt;/a&gt; that is intact&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/verifyISOSize/blob/master/testFiles/minimal_trunc.iso?raw=true&quot;&gt;Here’s the same image with most of its data truncated&lt;/a&gt; (the image is only 48 KB)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I ran both images through &lt;em&gt;isovfy&lt;/em&gt; (version 3.02a06); both resulted in the following output:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Root at extent 17, 2048 bytes
[0,0]
No errors found
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This demonstrates that &lt;em&gt;isovfy&lt;/em&gt; is not very useful for detecting truncated ISO files.&lt;/p&gt;

&lt;h2 id=&quot;digging-into-the-specs&quot;&gt;Digging into the specs&lt;/h2&gt;

&lt;p&gt;At this point I decided it was time to start digging into some specs. The &lt;a href=&quot;http://wiki.osdev.org/ISO_9660&quot;&gt;ISO 9660 page on the OSDev Wiki&lt;/a&gt; gives a good explanation of the internal organisation of an ISO 9660 image. From this I learnt that the &lt;a href=&quot;http://wiki.osdev.org/ISO_9660#The_Primary_Volume_Descriptor&quot;&gt;Primary Volume Descriptor&lt;/a&gt; (which is a data structure that is present on all ISO images) contains two interesting fields:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;Volume Space Size&lt;/em&gt;, which is the “number of Logical Blocks in which the volume is recorded”;&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;Logical Block Size&lt;/em&gt;, which is “the size in bytes of a logical block”.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In theory, multiplying both figures should give the expected size of the ISO image, and this would provide a useful way to check if data are missing. To test this, I wrote a Python script that parses an ISO’s Primary Volume Descriptor fields, calculates the expected file size and then compares this against the actual file size. Running the script against some 20 ISO images I had lying around showed that for 7 files the expected size was indeed identical to the actual file size. For most images, the actual size turned out to be marginally larger than expected (typically about 300-600 kB). For 3 images, the actual size was about twice the expected size. Digging deeper, I found out that these were hybrid images that contain an Apple partition on top of the ISO 9660 file system. According to &lt;a href=&quot;https://en.wikipedia.org/wiki/Hybrid_disc#Multiple_file_systems&quot;&gt;this Wikipedia article&lt;/a&gt;, these hybrid discs come in two varieties:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Hybrid discs that contain an Apple Partition Map (located at 512 bytes into the disc/image).&lt;/li&gt;
  &lt;li&gt;Hybrid discs without a Partition Map. These contain a Master Directory Block (located at 1024 bytes into the disc/image).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In my case all of the 3 hybrid images turned out to be of the first category. Using the information &lt;a href=&quot;https://en.wikipedia.org/wiki/Apple_Partition_Map#Layout&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://opensource.apple.com/source/IOStorageFamily/IOStorageFamily-116/IOApplePartitionScheme.h&quot;&gt;here&lt;/a&gt; I was able to add detection of such hybrid images to my code, as well as a simple parser for the ‘zero block’ structure that contains two fields that define the partition’s size: &lt;em&gt;Block Size&lt;/em&gt; and &lt;em&gt;Block Count&lt;/em&gt;. For my hybrid images, multiplying both figures resulted in a value that was close to (but again marginally smaller than) the actual file size.&lt;/p&gt;

&lt;p&gt;Finally, I also added detection of the second hybrid disc category (no Partition Map, but Master Directory Block). The Master Directory Block also contains Block Size and Block Count fields that allow one to calculate the size of the file system.&lt;/p&gt;

&lt;h2 id=&quot;isolyzer&quot;&gt;Isolyzer&lt;/h2&gt;

&lt;p&gt;I wrapped up the results of the above analyses into &lt;a href=&quot;https://github.com/KBNLresearch/verifyISOSize&quot;&gt;&lt;em&gt;Isolyzer&lt;/em&gt;&lt;/a&gt;, which is a dedicated (Python) tool for checking the size of an ISO image. What it does is this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Locate the image’s Primary Volume Descriptor (PVD).&lt;/li&gt;
  &lt;li&gt;From the PVD, read the Volume Space Size (number of sectors/blocks) and Logical Block Size (number of bytes for each block) fields.&lt;/li&gt;
  &lt;li&gt;Calculate the expected file size as ( Volume Space Size x Logical Block Size ).&lt;/li&gt;
  &lt;li&gt;If the image contains an Apple Partition Map, read the Block Size and Block Count fields from the ‘zero block’&lt;/li&gt;
  &lt;li&gt;Calculate the expected file size as ( Block Size x Block Count )&lt;/li&gt;
  &lt;li&gt;If the image contains an Apple Master Directory Block, read its Block Size and Block Count fields&lt;/li&gt;
  &lt;li&gt;Calculate the expected file size as ( Block Size x Block Count )&lt;/li&gt;
  &lt;li&gt;Calculate the final expected file size as the largest value out of any of the above 3 values&lt;/li&gt;
  &lt;li&gt;Compare this against the actual size of the image files.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In addition to this, Isolyzer also extracts and reports technical metadata from the Primary Volume Descriptor and the Zero Block.&lt;/p&gt;

&lt;p&gt;Currently the test results are reported in the following format (this may well change in upcoming releases):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;tests&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsISO9660Signature&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsISO9660Signature&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsApplePartitionMap&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsApplePartitionMap&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleHFSHeader&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleHFSHeader&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeExpected&amp;gt;&lt;/span&gt;358400&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeActual&amp;gt;&lt;/span&gt;358400&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeActual&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeDifference&amp;gt;&lt;/span&gt;0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeDifference&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeAsExpected&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeAsExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;smallerThanExpected&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/smallerThanExpected&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/tests&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the above example the &lt;em&gt;sizeExpected&lt;/em&gt; field is the size as calculated from the ISO/Apple headers, and &lt;em&gt;sizeActual&lt;/em&gt; is the actual size. In this case both are identical. Below some output for a truncated ISO:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;tests&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsISO9660Signature&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsISO9660Signature&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsApplePartitionMap&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsApplePartitionMap&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleHFSHeader&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleHFSHeader&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/containsAppleMasterDirectoryBlock&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/parsedPrimaryVolumeDescriptor&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeExpected&amp;gt;&lt;/span&gt;358400&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeActual&amp;gt;&lt;/span&gt;49157&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeActual&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeDifference&amp;gt;&lt;/span&gt;-309243&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeDifference&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;sizeAsExpected&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sizeAsExpected&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;smallerThanExpected&amp;gt;&lt;/span&gt;True&lt;span class=&quot;nt&quot;&gt;&amp;lt;/smallerThanExpected&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/tests&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, in this case &lt;em&gt;sizeDifference&lt;/em&gt; is negative, and flag &lt;em&gt;smallerThanExpected&lt;/em&gt; equals ‘True’ (which indicates a damaged image).&lt;/p&gt;

&lt;h2 id=&quot;feedback-wanted&quot;&gt;Feedback wanted&lt;/h2&gt;

&lt;p&gt;At this stage Isolyzer is a bit experimental and pretty rough around the edges, and I wouldn’t recommend it for production use. Nevertheless I’m curious about any feedback on the tool. Do others find this useful? Are things missing (i.e. other hybrid disc types I’m not aware of), or did I get anything completely wrong?&lt;/p&gt;

&lt;p&gt;One thing that puzzles me a bit is that for the majority of ISO images I’ve come across, the expected size as calculated by Isolyzer is marginally smaller than the actual size. The difference is typically in the order of about 300-600 kB. I’m not quite sure what’s causing this, although &lt;a href=&quot;http://twiki.org/cgi-bin/view/Wikilearn/CdromMd5sumsAfterBurning&quot;&gt;this article&lt;/a&gt; mentions that some CD writing software packages add padding bytes when writing a CD. I wasn’t able to verify if this, although &lt;a href=&quot;http://superuser.com/a/220353&quot;&gt;this SuperUser answer&lt;/a&gt; on validating a burnt DVD suggests it as well. If anyone knows more about this, please let me know!&lt;/p&gt;

&lt;h2 id=&quot;download-links&quot;&gt;Download links&lt;/h2&gt;

&lt;p&gt;Isolyzer can be found &lt;a href=&quot;https://github.com/KBNLresearch/verifyISOSize&quot;&gt;here on Github&lt;/a&gt;. It can be installed using &lt;em&gt;pip&lt;/em&gt;; &lt;a href=&quot;https://github.com/KBNLresearch/verifyISOSize#installation-with-pip&quot;&gt;see the instructions here&lt;/a&gt;. For Windows users who cannot/don’t want to install Python I also provided stand-alone Windows binaries, which are available for download &lt;a href=&quot;https://github.com/KBNLresearch/verifyISOSize/releases&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2017/01/13/detecting-broken-iso-images-introducing-isolyzer/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2017/01/13/detecting-broken-iso-images-introducing-isolyzer</link>
                <guid>https://bitsgalore.org/2017/01/13/detecting-broken-iso-images-introducing-isolyzer</guid>
                <pubDate>2017-01-13T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Breaking WAVEs (and some FLACs too)</title>
                <description>&lt;p&gt;At the KB we have a large collection of offline optical media. Most of these are CD-ROMs, but we also have a sizeable proportion of audio CDs. We’re currently in the process of designing a workflow for stabilising the contents of these materials using disk imaging. For audio CDs this involves ‘ripping’ the tracks to audio files. Since the workflow will be automated to a high degree, basic quality checks on the created audio files are needed. In particular, we want to be sure that the created audio files are complete, as it is possible that some hardware failure during the ripping process could result in truncated or otherwise incomplete files.&lt;/p&gt;

&lt;p&gt;To get a better idea of what software tool(s) are best suitable for this task, I created a small dataset of audio files which I deliberately damaged. I subsequently ran each of these files through a set of candidate tools, and then looked which tools were able to detect the faulty files. The first half of this blog post focuses on the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WAVE&quot;&gt;&lt;em&gt;WAVE&lt;/em&gt;&lt;/a&gt; format; the second half covers the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/FLAC&quot;&gt;&lt;em&gt;FLAC&lt;/em&gt;&lt;/a&gt; format (at the moment we haven’t decided on which format to use yet).&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;wave-dataset&quot;&gt;&lt;em&gt;WAVE&lt;/em&gt; dataset&lt;/h2&gt;

&lt;p&gt;For the &lt;em&gt;WAVE&lt;/em&gt; dataset I started out with a &lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01.wav&quot;&gt;small, intact &lt;em&gt;WAVE&lt;/em&gt; file&lt;/a&gt;. Using a Hex editor I then made the following derivatives of this file:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01-last-byte-missing.wav&quot;&gt;frogs-01-last-byte-missing.wav&lt;/a&gt; - one byte is missing at the end of the file&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01-last-2032-bytes-missing.wav&quot;&gt;frogs-01-last-2032-bytes-missing.wav&lt;/a&gt; - a chunk of  2032 bytes is missing at the end of the file&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01-byte-missing-at-offset-811537.wav&quot;&gt;frogs-01-byte-missing-at-offset-811537.wav&lt;/a&gt; - one byte is missing at offset 811537&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;candidate-tools-wave&quot;&gt;Candidate tools, &lt;em&gt;WAVE&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;The candidate tools I used to analyse the &lt;em&gt;WAVE&lt;/em&gt; files are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://jhove.openpreservation.org/&quot;&gt;&lt;strong&gt;jhove&lt;/strong&gt;&lt;/a&gt; includes a &lt;a href=&quot;http://jhove.openpreservation.org/modules/wave/&quot;&gt;&lt;em&gt;WAVE&lt;/em&gt; validation module&lt;/a&gt;, which makes it an obvious choice. The tested version is  1.14.6, 2016-05-12.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.etree.org/shnutils/shntool/&quot;&gt;&lt;strong&gt;shntool&lt;/strong&gt;&lt;/a&gt; is a “multi-purpose WAVE data processing and reporting utility”. It was first released in 2000. The tested version is 3.0.7.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://ffmpeg.org/&quot;&gt;&lt;strong&gt;ffmpeg&lt;/strong&gt;&lt;/a&gt; is a popular conversion tool for audio and video formats. The tested version is 3.2.2.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://mediaarea.net/en/MediaInfo&quot;&gt;&lt;strong&gt;mediainfo&lt;/strong&gt;&lt;/a&gt; is a widely-used feature extraction tool for audiovisual files. The tested version is v0.7.81.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that of the above tools, only Jhove and Shntool are designed to detect problems in &lt;em&gt;WAVE&lt;/em&gt; files. Both Ffmpeg and Mediainfo were primarily designed for other purposes (format conversion and technical metadata extraction), and they were &lt;em&gt;not&lt;/em&gt; designed to detect defective files! I included these tools here mainly because they are widely used, and I was curious whether they would throw up anything interesting in case of defective files&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.
I ran the tools with the following command-line arguments (replacing “foo.wav” with the actual file name):&lt;/p&gt;

&lt;h3 id=&quot;jhove&quot;&gt;Jhove&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;jhove &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; WAVE-hul foo.wav
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;shntool&quot;&gt;Shntool&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;shntool info foo.wav
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;ffmpeg&quot;&gt;Ffmpeg&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ffmpeg &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; error &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; foo.wav &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; null -
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;mediainfo&quot;&gt;Mediainfo&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mediainfo foo.wav
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;I automated this using a simple &lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/runtoolsWAV.sh&quot;&gt;shell script&lt;/a&gt; that runs each tool on all files, and then writes the output to a set of text files.&lt;/p&gt;

&lt;h2 id=&quot;results-wave&quot;&gt;Results, &lt;em&gt;WAVE&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;The full output results of each tool can be found &lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/tree/master/outputWAV&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;jhove-1&quot;&gt;Jhove&lt;/h3&gt;

&lt;p&gt;The ‘Status’ field in Jhove’s output summarises the validation outcome. Here are the results for each file:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Result&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Status: Well-Formed and valid&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-byte-missing.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Status: Well-Formed and valid&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-2032-bytes-missing.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Status: Well-Formed and valid&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-byte-missing-at-offset-811537.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Status: Well-Formed and valid&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, Jhove was unable to detect &lt;em&gt;any&lt;/em&gt; of the damaged files at all!&lt;/p&gt;

&lt;h3 id=&quot;shntool-1&quot;&gt;Shntool&lt;/h3&gt;

&lt;p&gt;Shntool checks a &lt;em&gt;WAVE&lt;/em&gt; on six criteria, which are listed in its output under ‘Possible problems’:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;Possible problems:
    File contains ID3v2 tag:    no
    Data chunk block-aligned:   yes
    Inconsistent header:        no
    File probably truncated:    no
    Junk appended to file:      no
    Odd data size has pad byte: n/a
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The thing to watch here is the ‘File probably truncated’ item:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Result&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-byte-missing.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    yes (missing 1 byte)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-2032-bytes-missing.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    yes (missing 2032 bytes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-byte-missing-at-offset-811537.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    yes (missing 1 byte)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, Shntool was able to detect all damaged files.&lt;/p&gt;

&lt;h3 id=&quot;ffmpeg-1&quot;&gt;Ffmpeg&lt;/h3&gt;

&lt;p&gt;For our Ffmpeg call we monitor any errors that are sent to the standard error stream. The results:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;result&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-byte-missing.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;[pcm_s16le @ 0x3545380] Invalid PCM packet, data has size 3 but at least a size of 4 was expected&lt;br /&gt;Error while decoding stream #0:0: Invalid data found when processing input&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-2032-bytes-missing.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-byte-missing-at-offset-811537.wav&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;[pcm_s16le @ 0x2768380] Invalid PCM packet, data has size 3 but at least a size of 4 was expected&lt;br /&gt;Error while decoding stream #0:0: Invalid data found when processing input&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Interestingly, Ffmpeg reports an error for both files that have 1 byte missing, but it doesn’t for the file that has 2023 bytes missing. This suggests that Ffmpeg is &lt;em&gt;not&lt;/em&gt; suitable for detecting broken &lt;em&gt;WAVE&lt;/em&gt; files.&lt;/p&gt;

&lt;h3 id=&quot;mediainfo-1&quot;&gt;Mediainfo&lt;/h3&gt;

&lt;p&gt;Mediainfo didn’t report errors or warnings for any of these files. This is not surprising, but it does 
confirm that Mediainfo cannot be used for detecting broken &lt;em&gt;WAVE&lt;/em&gt; files.&lt;/p&gt;

&lt;h2 id=&quot;flac-dataset&quot;&gt;&lt;em&gt;FLAC&lt;/em&gt; dataset&lt;/h2&gt;

&lt;p&gt;Analogous to the &lt;em&gt;WAVE&lt;/em&gt; dataset, I started out with a &lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01.flac&quot;&gt;small, intact &lt;em&gt;FLAC&lt;/em&gt; file&lt;/a&gt;, which I then butchered into the following derivative files:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01-last-byte-missing.flac&quot;&gt;frogs-01-last-byte-missing.flac&lt;/a&gt; - one byte is missing at the end of the file&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01-last-1000-bytes-missing.flac&quot;&gt;frogs-01-last-1000-bytes-missing.flac&lt;/a&gt; - a chunk of  1000 bytes is missing at the end of the file&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/blob/master/data/frogs-01-byte-missing-at-offset-651202.flac&quot;&gt;frogs-01-byte-missing-at-offset-651202.flac&lt;/a&gt; - one byte is missing at offset 651202&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;candidate-tools-flac&quot;&gt;Candidate tools, &lt;em&gt;FLAC&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;The set of candidate tools is identical to the one used for the &lt;em&gt;WAVE&lt;/em&gt; analysis, with two exceptions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://xiph.org/flac/&quot;&gt;&lt;strong&gt;flac&lt;/strong&gt;&lt;/a&gt; is the reference implementation of the &lt;em&gt;FLAC&lt;/em&gt; format. The tested version is 1.3.0.&lt;/li&gt;
  &lt;li&gt;Since Jhove does not include a &lt;em&gt;FLAC&lt;/em&gt; module, it was not used.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;flac&quot;&gt;Flac&lt;/h3&gt;

&lt;p&gt;The Flac tool is able to encode audio to &lt;em&gt;FLAC&lt;/em&gt;, and decode and analyze &lt;em&gt;FLAC&lt;/em&gt; files. For this tests I ran it with the * -t* (or &lt;em&gt;–test&lt;/em&gt;) option:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;flac &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; foo.flac
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This decodes a &lt;em&gt;FLAC&lt;/em&gt; without writing the decoded data to a file. Any errors during the decoding process are reported to the standard error stream.&lt;/p&gt;

&lt;h2 id=&quot;results-flac&quot;&gt;Results, &lt;em&gt;FLAC&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;The full output results of each tool can be found &lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio/tree/master/outputFLAC&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;shntool-2&quot;&gt;Shntool&lt;/h3&gt;

&lt;p&gt;Even though Shntool supports &lt;em&gt;FLAC&lt;/em&gt;, it was not able to detect the missing data in any of the files:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Result&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-byte-missing.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-1000-bytes-missing.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    no&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-byte-missing-at-offset-651202.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File probably truncated:    no&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, Shntool does not provide any meaningful information on whether a &lt;em&gt;FLAC&lt;/em&gt; is damaged.&lt;/p&gt;

&lt;h3 id=&quot;ffmpeg-2&quot;&gt;Ffmpeg&lt;/h3&gt;

&lt;p&gt;Here are the results for Ffmpeg:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Result&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-byte-missing.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;[flac @ 0x294b860] overread: 1&lt;br /&gt;Error while decoding stream #0:0: Invalid data found when processing input&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-1000-bytes-missing.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;[flac @ 0x3c5d860] overread: 1&lt;br /&gt;Error while decoding stream #0:0: Invalid data found when processing input&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-byte-missing-at-offset-651202.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;[flac @ 0x279faa0] overread: 1&lt;br /&gt;Error while decoding stream #0:0: Invalid data found when processing input&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, Ffmpeg was able to identify all damaged &lt;em&gt;FLAC&lt;/em&gt;s.&lt;/p&gt;

&lt;h3 id=&quot;mediainfo-2&quot;&gt;Mediainfo&lt;/h3&gt;

&lt;p&gt;Similar to the &lt;em&gt;WAVE&lt;/em&gt; results, Mediainfo again didn’t report errors or warnings for any of these files.&lt;/p&gt;

&lt;h3 id=&quot;flac-1&quot;&gt;Flac&lt;/h3&gt;

&lt;p&gt;Finally the results for the Flac tool:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;File&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Result&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-byte-missing.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ERROR while decoding data&lt;br /&gt;state = FLAC__STREAM_DECODER_END_OF_STREAM&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-last-1000-bytes-missing.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ERROR while decoding data&lt;br /&gt;state = FLAC__STREAM_DECODER_END_OF_STREAM&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;frogs-01-byte-missing-at-offset-651202.flac&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ERROR while decoding data&lt;br /&gt;state = FLAC__STREAM_DECODER_READ_FRAME&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, the Flac tool was able to identify all defective files&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;Out of the candidate tools considered here, only Shntool was able to identify all damaged &lt;em&gt;WAVE&lt;/em&gt; files in this experiment. As a result, this (ancient!) tool still appears to be the best choice for detecting damaged &lt;em&gt;WAVE&lt;/em&gt; files. Surpringly, Jhove was unable to detect &lt;em&gt;any&lt;/em&gt; of the damaged files at all, and is probably best avoided for this particular purpose. For &lt;em&gt;FLAC&lt;/em&gt;, both the Flac tool (&lt;em&gt;FLAC&lt;/em&gt; reference implementation) and Ffmpeg were able to detect all damaged files, and both appear to be suitable tools.&lt;/p&gt;

&lt;h2 id=&quot;dataset-and-scripts&quot;&gt;Dataset and scripts&lt;/h2&gt;

&lt;p&gt;All example files, scripts and raw tool output are available here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/KBNLresearch/detectDamagedAudio&quot;&gt;https://github.com/KBNLresearch/detectDamagedAudio&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;post-scriptum-update-on-mediainfo-and-mediaconch&quot;&gt;Post scriptum: update on MediaInfo and MediaConch&lt;/h2&gt;

&lt;p&gt;In response to this post the developers of MediaInfo &lt;a href=&quot;https://github.com/MediaArea/MediaInfoLib/pull/352&quot;&gt;added support for detecting truncated &lt;em&gt;WAVE&lt;/em&gt; files&lt;/a&gt;. This should cover all of the damaged &lt;em&gt;WAVE&lt;/em&gt; files presented here. Moreover, their Twitter account announced that &lt;a href=&quot;https://twitter.com/MediaArea_Net/status/817303297786867712&quot;&gt;detection of &lt;em&gt;FLAC&lt;/em&gt; flaws&lt;/a&gt; is planned for the &lt;a href=&quot;https://mediaarea.net/MediaConch/&quot;&gt;MediaConch tool&lt;/a&gt;, but that they are looking for sponsors for this.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2017/01/04/breaking-waves-and-some-flacs/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Also, &lt;a href=&quot;http://superuser.com/a/100290/681049&quot;&gt;this thread on &lt;em&gt;superuser.com&lt;/em&gt;&lt;/a&gt; recommends Ffmpeg for checking the integrity of video files. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;On a side note, I noticed that the error stream of the Flac tool sometimes contained a sequence of 21 non-printable ‘0x08’ (backspace0 characters. This is probably a bug. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2017/01/04/breaking-waves-and-some-flacs</link>
                <guid>https://bitsgalore.org/2017/01/04/breaking-waves-and-some-flacs</guid>
                <pubDate>2017-01-04T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>PDF/A as a preferred, sustainable format for spreadsheets?</title>
                <description>&lt;p&gt;Earlier this week the National Archives of the Netherlands (NANeth) published a &lt;a href=&quot;http://www.nationaalarchief.nl/sites/default/files/docs/na_rapport_voorkeursformaten-web_0.pdf&quot;&gt;report on preferred file formats&lt;/a&gt;. It gives an overview of NANeth’s ‘preferred’ and ‘acceptable’ formats for 9 content categories, and also explains the reasoning behind the selected formats. Even though in Dutch language only, the report is well worth a look. However, I found a few of the choices a little surprising, especially the ‘spreadsheet’ category for which it lists the following ‘preferred’ and ‘acceptable’ formats:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Preferred&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Acceptable&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ODS, CSV, PDF/A&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;XLS, XLSX&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The report gives the following explanation on the ‘preferred’ formats (translated from Dutch):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
    &lt;li&gt;ODS - ODS is part of the OpenDocument standard (ODF, NEN-ISO/IEC
26300:2007), which is listed as the standard for office documents on the &lt;a href=&quot;https://www.forumstandaardisatie.nl/lijst-open-standaarden/in_lijst/verplicht-pas-toe-leg-uit&quot;&gt;‘act or explain’ list&lt;/a&gt; of ‘Forum Standaardisatie’&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;&lt;/li&gt;
    &lt;li&gt;CSV - for the storage of non-interactive information in cells, a comma-delimited (.csv) text file can be used instead of a spreadsheet&lt;/li&gt;
    &lt;li&gt;PDF/A - PDF/A is a widely used open standard and a NEN/ISO standard (ISO:19005). PDF/A-1 and PDF/A-2 are part of the ‘act or explain’ list of ‘Forum Standaardisatie’. Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A&lt;/li&gt;
  &lt;/ul&gt;

&lt;/blockquote&gt;

&lt;p&gt;In the remainder of this blog post I will pinpoint some problems of the choice of PDF/A and its justification.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;demo-spreadsheet&quot;&gt;Demo spreadsheet&lt;/h2&gt;

&lt;p&gt;To illustrate my arguments, I created a &lt;a href=&quot;https://github.com/bitsgalore/spreadsheetsPDF/raw/master/demoNumbersCalculations.xlsx&quot;&gt;simple demo spreadsheet&lt;/a&gt; in xlsx format (created in Microsoft Excel 2010). It contains two columns:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Column A: random number between 0 and 100 (as static values)&lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Column B: formula that takes the value from Column A and adds its square root:&lt;/p&gt;

    &lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;=A2 + SQRT(A2)
&lt;/code&gt;&lt;/pre&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;displayed-precision-not-equal-to-stored-precision&quot;&gt;Displayed precision not equal to stored precision&lt;/h2&gt;

&lt;p&gt;Without applying any special formatting, this is what the spreadsheet looks like in MS Excel 2010:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/12/numbers2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The first thing of interest here is that the displayed values in the cells are different from those that are actually stored! For example, the value that is shown in cell A2 is:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;52.06077146
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that 8 decimal places are shown. But by looking at the formula bar you can see a different value:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;52.0607714623856
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which contains 13 decimal places. Since Excel internally &lt;a href=&quot;https://en.wikipedia.org/wiki/Numeric_precision_in_Microsoft_Excel&quot;&gt;stores numbers at a precision of 15 significant figures&lt;/a&gt;, only the latter corresponds to the actual (stored) value.&lt;/p&gt;

&lt;h2 id=&quot;loss-of-precision-after-exporting-to-pdfa&quot;&gt;Loss of precision after exporting to PDF/A&lt;/h2&gt;

&lt;p&gt;I exported the spreadsheet to PDF/A-1a using Acrobat PDFMaker. The result can be found &lt;a href=&quot;https://github.com/bitsgalore/spreadsheetsPDF/raw/master/demoNumbersCalculations.pdf&quot;&gt;here&lt;/a&gt;. Below is what the PDF looks like when opened in Adobe Acrobat:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/12/numbers2_pdfa.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So, the PDF only contains the values at Excel’s displayed precision (in this case typically 9-10 significant figures), and the remaining precision got lost in the conversion.&lt;/p&gt;

&lt;p&gt;In addition, unlike the source spreadsheet, the PDF &lt;em&gt;only&lt;/em&gt; contains static numbers. This means that information about the relation between the values in Columns A and B (i.e. the formula) is completely lost.&lt;/p&gt;

&lt;h2 id=&quot;loss-of-precision-after-exporting-to-csv&quot;&gt;Loss of precision after exporting to CSV&lt;/h2&gt;

&lt;p&gt;Interestingly, exporting to a comma-delimited text file resulted in the same loss of precision! See &lt;a href=&quot;https://github.com/bitsgalore/spreadsheetsPDF/blob/master/demoNumbersCalculations.csv&quot;&gt;the exported CSV file here&lt;/a&gt;. For brevity I won’t go into any further detail on CSV, but it’s important to be aware that this issue exists.&lt;/p&gt;

&lt;h2 id=&quot;effects-of-cell-formatting&quot;&gt;Effects of cell formatting&lt;/h2&gt;

&lt;p&gt;A possible way around the rounding issue would be to use Excel’s &lt;em&gt;Format Cells&lt;/em&gt; dialog, which allows one to set a fixed number of decimal places to be used for display:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/12/formatcells.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is also less than ideal, if only for the reason that a fixed value will result in the display of non-significant figures. For example, applying a setting of 14 decimal places to the value in cell A1 results in:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;52.06077146238560
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;which is different from the stored value:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;52.0607714623856
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Moreover, this approach gets extremely cumbersome for spreadsheets that contain numbers at different precisions (e.g. it is pretty common to have one column with integer values, and another one with floating-point numbers).&lt;/p&gt;

&lt;p&gt;In practice, Excel’s number formatting is often used to &lt;em&gt;reduce&lt;/em&gt; the number of displayed digits (e.g. to make the columns more visually pleasing, or to avoid messy output when printing). &lt;a href=&quot;https://github.com/bitsgalore/spreadsheetsPDF/raw/master/demoDisplay2DigitsOnly.xlsx&quot;&gt;Here’s a version of the spreadsheet&lt;/a&gt; where I adjusted the formatting to display two decimal places only, and &lt;a href=&quot;https://github.com/bitsgalore/spreadsheetsPDF/raw/master/demoDisplay2DigitsOnly.pdf&quot;&gt;here is the resulting PDF&lt;/a&gt;. It looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/12/formatcells_pdfa.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So in this case even more information is lost!&lt;/p&gt;

&lt;h2 id=&quot;interactive-or-dynamic&quot;&gt;Interactive or dynamic?&lt;/h2&gt;

&lt;p&gt;The preferred formats document does acknowledge that PDF/A may not always be suited for spreadsheets, using the following statement (in Dutch):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Let wel: bepaalde (interactieve) functionaliteit zal na omzetting naar PDF/A formaat niet meer beschikbaar zijn. Als deze functionaliteit als essentieel wordt beschouwd, is dit een reden om niet voor
PDF/A te kiezen&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which translates in English as:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Note: some (interactive) functionality will not be available after conversion to PDF/A. If this functionality is deemed essential, this will be a reason for not choosing PDF/A&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This statement is problematic for various reasons. First, whether functionality is deemed ‘essential’ largely depends on the context and intended user base. By stressing the &lt;em&gt;interactive&lt;/em&gt; aspect, the authors imply (perhaps unintentionally?) that any spreadsheets that do not take any interaction with a user can be safely converted to PDF/A. But what does ‘interactive’ mean in this context? Taking my earlier &lt;a href=&quot;https://github.com/bitsgalore/spreadsheetsPDF/raw/master/demoNumbersCalculations.xlsx&quot;&gt;sample spreadsheet&lt;/a&gt; as an example: a user may ‘interact’ with that spreadsheet by changing the values in Column A, after which all values in Column B are recalculated. Does that make it interactive? If yes, applying the ‘interactivity’ criterion like this would cover &lt;em&gt;any&lt;/em&gt; spreadsheet for which the value in any cell is dependent on one or more values in other cells. This applies to most spreadsheets, apart from those that only contain static data. But in that case a distinction between ‘static’ and ‘dynamic’ spreadsheets might be more useful&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;reading-pdfa-spreadsheets&quot;&gt;Reading PDF/A spreadsheets&lt;/h2&gt;

&lt;p&gt;Finally, I’m quite puzzled how a PDF/A representation of a spreadsheet is meant to be read. Who are the intended users? What is the target software? What is the context? Sure enough a PDF may be sufficient for on-screen viewing, but what if a (future) user wants to recover the original row and column values? Excel is not capable of this (in fact it cannot even import a PDF at all)? What if someone wants to use the data for some actual calculations? Data extraction from PDF is notoriously difficult (hence the phrase &lt;a href=&quot;https://twitter.com/search?q=%22pdf%20is%20where%20data%20goes%20to%20die%22&amp;amp;src=typd&quot;&gt;“pdf is where data goes to die”&lt;/a&gt;), which is mainly due to the lack of structure of the format&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;concluding-remarks&quot;&gt;Concluding remarks&lt;/h2&gt;

&lt;p&gt;The above observations only scrape the surface of the perils of using PDF for spreadsheet data. To be clear: there may be situations where &lt;em&gt;PDF/A&lt;/em&gt; is a good (and possibly even the best) choice. For example, spreadsheets are often used for printable forms, and having these as a PDF/A representation may be perfectly fine&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Nevertheless, NANeth’s recommendations on choosing between their ‘preferred formats’ appear to be suboptimal, because they do not take into account the purpose for which a spreadsheet was created, its content, its intended use and the intended (future) user(s). In particular, using ‘interactivity’ as the main criterion seems somewhat dangerous.&lt;/p&gt;

&lt;h2 id=&quot;data&quot;&gt;Data&lt;/h2&gt;

&lt;p&gt;The example files that are referred to in this blog post are all available here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/bitsgalore/spreadsheetsPDF&quot;&gt;https://github.com/bitsgalore/spreadsheetsPDF&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Forum Standaardisatie is a Dutch government body that promotes the use of open standards in the public sector. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;On a related note, it is well known that (formulae in) spreadsheets often contain errors, and that these can have major implications (there are numerous examples on the &lt;a href=&quot;http://www.eusprig.org/horror-stories.htm&quot;&gt;&lt;em&gt;Horror Stories&lt;/em&gt; section&lt;/a&gt; of the &lt;a href=&quot;http://www.eusprig.org/&quot;&gt;European Spreadsheet Risks Interest Group &lt;/a&gt;). Once converted to PDF/A, such errors are impossible to detect. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Andy Jackson once compared this to &lt;a href=&quot;https://twitter.com/anjacks0n/status/471242447813898242&quot;&gt;“reconstructing the cow from the burger”&lt;/a&gt; &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;Incidentally, spreadsheet forms can be highly interactive (e.g. by letting a user enter data by selecting a value from a drop-down list); this is again an indication that interactivity may not be a good criterion for deciding on PDF/A as a target format &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets</link>
                <guid>https://bitsgalore.org/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets</guid>
                <pubDate>2016-12-09T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Valid, but not accessible&#58; crazy fixed EPUB layouts</title>
                <description>&lt;p&gt;&lt;a href=&quot;https://github.com/IDPF/epubcheck&quot;&gt;&lt;em&gt;EpubCheck&lt;/em&gt;&lt;/a&gt; is an invaluable tool for assessing the quality of &lt;em&gt;EPUB&lt;/em&gt; files. Still, it is possible that &lt;em&gt;EPUB&lt;/em&gt;s that are valid according to the format specification (and thus &lt;em&gt;EpubCheck&lt;/em&gt;) are nevertheless inaccessible to some users. Some weeks ago a colleague sent me an &lt;em&gt;EPUB&lt;/em&gt; 2 file that produced some really strange behaviour across a number of viewer applications. For a start, the text wouldn’t reflow properly after re-sizing the viewer window, and increasing the font size resulted in garbled text. Running the file through &lt;em&gt;EpubCheck&lt;/em&gt; did return some validation errors, but none of these were related to the behaviour I was getting. Closer inspection revealed some very peculiar stylesheet and &lt;em&gt;HTML&lt;/em&gt; use.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;crazy-fixed-layout&quot;&gt;Crazy Fixed Layout&lt;/h2&gt;

&lt;p&gt;As I cannot share the original file for rights reasons, I fired up the &lt;a href=&quot;https://sigil-ebook.com/&quot;&gt;&lt;em&gt;Sigil&lt;/em&gt;&lt;/a&gt; e-book editor and made a handcrafted &lt;em&gt;EPUB&lt;/em&gt; that reproduces its behaviour. You can &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_crazy_fixed_layout.epub?raw=true&quot;&gt;download the file here&lt;/a&gt;. If you open it in an e-book viewer, it will probably look perfectly normal at first sight. For example, here’s a screenshot I made using the &lt;a href=&quot;https://calibre-ebook.com/&quot;&gt;&lt;em&gt;Calibre&lt;/em&gt;&lt;/a&gt; viewer:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/calibre_normal.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Next I reduced the width of the viewer window. One would expect the text to re-flow to the new width. Instead this happened:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/calibre_resized_screen.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After increasing the font size, I ended up with this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/calibre_largefont.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I got similar results in &lt;a href=&quot;https://chrome.google.com/webstore/detail/readium/fepbnnnkkadjhjahcafoaglimekefifl&quot;&gt;&lt;em&gt;Chome’s Readium&lt;/em&gt; extension&lt;/a&gt;. On my e-Ink reader, a Sony PRS-T2, the book rendered as follows:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/sony_fixedlayout.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However, I wasn’t able to change the font size.&lt;/p&gt;

&lt;h2 id=&quot;analysis&quot;&gt;Analysis&lt;/h2&gt;

&lt;p&gt;The file &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/epubcheckout/4.0.1/epub20_crazy_fixed_layout.xml&quot;&gt;passes validation in &lt;em&gt;EpubCheck&lt;/em&gt; 4.0.1&lt;/a&gt; without errors. However, the output does contain a series of warnings about the use of absolute positions in a stylesheet:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;CSS-017, WARN, [CSS selector specifies absolute position.], OEBPS/Styles/styles.css (6-2)
CSS-017, WARN, [CSS selector specifies absolute position.], OEBPS/Styles/styles.css (24-1)
CSS-017, WARN, [CSS selector specifies absolute position.], OEBPS/Styles/styles.css (43-1)
::
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To really understand what causes the problem, we need to look inside the file’s &lt;em&gt;HTML&lt;/em&gt; and &lt;em&gt;CSS&lt;/em&gt; resources. Here’s some of the &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/content/epub20_crazy_fixed_layout/OEBPS/Text/Section0001.xhtml&quot;&gt;&lt;em&gt;HTML&lt;/em&gt; that underlies the text&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;p&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;p01&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;para&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;This is an &lt;span class=&quot;nt&quot;&gt;&amp;lt;em&amp;gt;&lt;/span&gt;EPUB&lt;span class=&quot;nt&quot;&gt;&amp;lt;/em&amp;gt;&lt;/span&gt; 2 file that uses a fixed layout.&lt;span class=&quot;nt&quot;&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;p&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;p02&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;para&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;This is achieved by placing each line inside a&lt;span class=&quot;nt&quot;&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;p&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;p03&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;para&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&amp;lt;em&amp;gt;&lt;/span&gt;paragraph&lt;span class=&quot;nt&quot;&gt;&amp;lt;/em&amp;gt;&lt;/span&gt; element. Each &lt;span class=&quot;nt&quot;&gt;&amp;lt;em&amp;gt;&lt;/span&gt;paragraph&lt;span class=&quot;nt&quot;&gt;&amp;lt;/em&amp;gt;&lt;/span&gt; element&lt;span class=&quot;nt&quot;&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;p&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;p04&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;para&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;is placed at a fixed position on the page. Even&lt;span class=&quot;nt&quot;&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;p&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;p05&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;para&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;though this file is valid &lt;span class=&quot;nt&quot;&gt;&amp;lt;em&amp;gt;&lt;/span&gt;EPUB&lt;span class=&quot;nt&quot;&gt;&amp;lt;/em&amp;gt;&lt;/span&gt;, this is a pretty&lt;span class=&quot;nt&quot;&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;p&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;p06&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;para&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt; terrible idea, because in most readers the text&lt;span class=&quot;nt&quot;&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;p&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;p07&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;para&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;will not reflow after resizing the viewer window.&lt;span class=&quot;nt&quot;&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, every line is wrapped inside a &lt;em&gt;paragraph&lt;/em&gt; element, each of which has a unique &lt;em&gt;id&lt;/em&gt; selector. These refer to style definitions in the &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/content/epub20_crazy_fixed_layout/OEBPS/Styles/styles.css&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt;’s stylesheet&lt;/a&gt;. Here are the definitions for the first two lines:&lt;/p&gt;

&lt;div class=&quot;language-css highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nf&quot;&gt;#p01&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;position&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;absolute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;40px&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;80px&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;letter-spacing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.42px&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;word-spacing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.1em&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;nf&quot;&gt;#p02&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;position&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;absolute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;left&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;40px&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;top&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;120px&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;letter-spacing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.42px&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;nl&quot;&gt;word-spacing&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;m&quot;&gt;0.1em&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each style definition specifies a line’s position on the canvas (&lt;em&gt;left&lt;/em&gt;, &lt;em&gt;top&lt;/em&gt;); moreover, these co-ordinates are defined as &lt;em&gt;absolute&lt;/em&gt; positions. This means that each line is placed at a fixed position, regardless of whether this makes any sense given the actual dimensions of the viewer window (or device), or the user’s preferred font size. It seems that the intention of the producer of the original &lt;em&gt;EPUB&lt;/em&gt; (from which I derived my example) was to create some sort of &lt;a href=&quot;http://www.idpf.org/epub/fxl/&quot;&gt;“fixed layout”&lt;/a&gt; document. However, this doesn’t make much sense for books with simple, text-only layouts (as in this case). Worse, depending on the viewing device and the user the file may be effectively inaccessible. For example, someone with a visual impairment may only be able to read an &lt;em&gt;EPUB&lt;/em&gt; using very large font sizes, which in this case results in garbled text.&lt;/p&gt;

&lt;h2 id=&quot;crazy-columns&quot;&gt;Crazy Columns&lt;/h2&gt;

&lt;p&gt;Things can even get worse. I once came across an &lt;em&gt;EPUB&lt;/em&gt; that used similar tricks to achieve a two-column layout. Again I’m not able to share the original file, so I created &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_crazy_columns.epub?raw=true&quot;&gt;another &lt;em&gt;EPUB&lt;/em&gt; that mimicks its behavour&lt;/a&gt;. In the &lt;em&gt;Calibre&lt;/em&gt; viewer it looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/calibre_columns.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As with the first example, the text doesn’t reflow after resizing the viewer window, and increasing the font resulted in this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/calibre_columns_largefont.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is what I got when I opened the file in my Sony e-Ink reader:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/sony_crazycolumns1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After I increased the font size this happened:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2016/04/sony_crazycolumns2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Similarly, when I tried to copy the text in the file to the clipboard, and then pasted it in a text editor, I ended up with this:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;This is an EPUB filepage. Even though thisthat uses a two-columnfile is valid EPUB, there’slayout. For each column,no way to establish theevery line is placed atlogical reading order ofa fixed position on thethe text.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Ouch!&lt;/p&gt;

&lt;h2 id=&quot;analysis-1&quot;&gt;Analysis&lt;/h2&gt;

&lt;p&gt;Again, throwing this file at &lt;em&gt;EpubCheck&lt;/em&gt; 4 &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/epubcheckout/4.0.1/epub20_crazy_columns.xml&quot;&gt;doesn’t result in any validation errors&lt;/a&gt;, although just like the previous file there are some warnings about the use of absolute positions in the stylesheet:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;CSS-017, WARN, [CSS selector specifies absolute position.], OEBPS/Styles/styles.css (13-1)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;A peek &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/content/epub20_crazy_columns/OEBPS/Text/Section0001.xhtml&quot;&gt;inside the &lt;em&gt;HTML&lt;/em&gt;&lt;/a&gt; reveals the true horrors of this &lt;em&gt;EPUB&lt;/em&gt;. This is how the text is encoded:&lt;/p&gt;

&lt;div class=&quot;language-html highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;div&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pos&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;style=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;left: 40px; top: 100px;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;This is an &lt;span class=&quot;nt&quot;&gt;&amp;lt;em&amp;gt;&lt;/span&gt;EPUB&lt;span class=&quot;nt&quot;&gt;&amp;lt;/em&amp;gt;&lt;/span&gt; file&lt;span class=&quot;nt&quot;&gt;&amp;lt;div&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;div&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pos&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;style=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;left: 260px; top: 100px;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;page. Even though this&lt;span class=&quot;nt&quot;&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;div&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pos&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;style=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;left: 40px; top: 140px;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;that uses a two-column&lt;span class=&quot;nt&quot;&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;div&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;class=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pos&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;style=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;left: 260px; top: 140px;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;file is valid &lt;span class=&quot;nt&quot;&gt;&amp;lt;em&amp;gt;&lt;/span&gt;EPUB&lt;span class=&quot;nt&quot;&gt;&amp;lt;/em&amp;gt;&lt;/span&gt;, there&apos;s&lt;span class=&quot;nt&quot;&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;So, every line of each column is wrapped in a division element that has a fixed position. The class &lt;em&gt;pos&lt;/em&gt; in the &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/content/epub20_crazy_columns/OEBPS/Styles/styles.css&quot;&gt;stylesheet&lt;/a&gt; defines the general layout of each division element. In this case, it specifies that all positions are (again) absolute:&lt;/p&gt;

&lt;div class=&quot;language-css highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nc&quot;&gt;.pos&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;nl&quot;&gt;position&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;absolute&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Technically this is pretty similar to the first example. Note that the above &lt;em&gt;HTML&lt;/em&gt; doesn’t contain any semantic information on the fact that there are two separate columns. Worse, the order of the text in the HTML doesn’t even follow the actual reading order! This also explains the results after copying and pasting. &lt;a href=&quot;https://en.wikipedia.org/wiki/Screen_reader&quot;&gt;Screen reader&lt;/a&gt; applications will not be able to handle this either, which makes books like these inaccessible to many visually impaired users. All of this could have been avoided if the book’s producer had followed the &lt;a href=&quot;https://www.w3.org/TR/css3-multicol/&quot;&gt;W3C multi-column layout specification&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;I don’t know how common (or rare) &lt;em&gt;EPUB&lt;/em&gt;s like the above are. They may just be weird edge cases. Nevertheless, their existence indicates that checking for validity alone may not be sufficient to ensure accessibility for all users (in particular those with a visual impairment). In any case, files like these can be identified relatively easily by checking &lt;em&gt;EpubCheck&lt;/em&gt;’s output for the presence of a &lt;em&gt;CSS-017&lt;/em&gt; warning (“CSS selector specifies absolute position”)&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. These examples also underline the importance of guidelines and best practices. Several good resources for making accessible &lt;em&gt;EPUB&lt;/em&gt; are available from the &lt;a href=&quot;http://www.idpf.org/accessibility/guidelines/&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt; 3 Accessibility Guidelines&lt;/a&gt;, including a useful &lt;a href=&quot;http://www.idpf.org/accessibility/guidelines/content/qa/qa-checklist.php&quot;&gt;Accessibility QA Checklist&lt;/a&gt;. I would also be interested in hearing other people’s experiences with “weird” &lt;em&gt;EPUB&lt;/em&gt;s like these.&lt;/p&gt;

&lt;h2 id=&quot;postscript&quot;&gt;Postscript&lt;/h2&gt;

&lt;p&gt;Alberto Pettarin pointed me to his blog post &lt;a href=&quot;http://www.albertopettarin.it/blog/2015/02/21/current-fixed-layout-ebooks-considered-harmful.html&quot;&gt;&lt;em&gt;(Current) Fixed Layout eBooks Considered Harmful&lt;/em&gt;&lt;/a&gt;. Written in 2015, it addresses the problems with current implementations of fixed layouts in &lt;em&gt;EPUB&lt;/em&gt;, and if you found this blog post interesting, I would suggest to check out Alberto’s blog as well.&lt;/p&gt;

&lt;p&gt;Alberto’s &lt;a href=&quot;https://twitter.com/acutebit/status/718031931221360640&quot;&gt;Twitter feed&lt;/a&gt; also drew my attention to an interesting &lt;em&gt;EPUB&lt;/em&gt; with the program of the recent &lt;em&gt;EPUB&lt;/em&gt; Summit in Bordeaux. You can &lt;a href=&quot;http://edrlab.org/edrlab/wp-content/uploads/2016/04/EDRLabprogram_EN_HD_final.epub_.zip&quot;&gt;download it here&lt;/a&gt; (you need to unzip it first!). The file is interesting because:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;It does not pass validation by &lt;em&gt;EpubCheck&lt;/em&gt; (the mimetype file entry is not the first file resource in the archive)&lt;/li&gt;
  &lt;li&gt;It uses a fixed, multi-column layout that doesn’t scale in either &lt;em&gt;Readium&lt;/em&gt; or &lt;em&gt;Calibre&lt;/em&gt;’s viewer (changing the font size has no effect), and I’m wondering if it is usable at all on any handheld devices!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There’s some irony in that this file was published by &lt;a href=&quot;http://edrlab.org/edrlab/&quot;&gt;&lt;em&gt;EDRLab&lt;/em&gt;&lt;/a&gt;, an organisation that describes itself as “the European headquarter for IDPF and Readium Foundation”, and which mentions “support for people who have print disabilities” as a “key part”of its mission. Oh well …&lt;/p&gt;

&lt;h2 id=&quot;link-to-dataset&quot;&gt;Link to dataset&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;EPUB&lt;/em&gt;s used for this blog post are part of the &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests&quot;&gt;EPUB KB policy testing repository&lt;/a&gt;. This is an annotated set of openly licensed &lt;em&gt;EPUB&lt;/em&gt; files that were specifically created for testing purposes.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.kbresearch.nl/2016/04/04/valid-but-not-accessible-epub-crazy-fixed-layouts/&quot;&gt;KB Research blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Note that &lt;em&gt;EpubCheck&lt;/em&gt; 3 (now outdated) &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/epubcheckout/3.0.1/epub20_crazy_columns.xml&quot;&gt;does not report this warning&lt;/a&gt;, so always use &lt;em&gt;EpubCheck&lt;/em&gt; 4. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2016/04/04/valid-but-not-accessible-epub-crazy-fixed-layouts</link>
                <guid>https://bitsgalore.org/2016/04/04/valid-but-not-accessible-epub-crazy-fixed-layouts</guid>
                <pubDate>2016-04-04T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>The future of EPUB? A first look at the EPUB 3.1 Editor’s draft</title>
                <description>&lt;p&gt;About a month ago the &lt;a href=&quot;http://idpf.org/&quot;&gt;International Digital Publishing Forum&lt;/a&gt;, the standards body behind the &lt;em&gt;EPUB&lt;/em&gt; format, published an 
&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-spec.html&quot;&gt;Editor’s Draft of &lt;em&gt;EPUB&lt;/em&gt; 3.1&lt;/a&gt;. This is meant to be the successor of the &lt;a href=&quot;http://idpf.org/epub/301&quot;&gt;current 3.0.1 version&lt;/a&gt;. 
IDPC has set up a &lt;a href=&quot;http://idpf.org/news/first-editors-draft-of-epub-31-available-for-review&quot;&gt;community review&lt;/a&gt;, which allows interested parties to comment on the draft. The proposed changes relative to &lt;em&gt;EPUB&lt;/em&gt; 3.0.1 are summarised &lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html&quot;&gt;in this document&lt;/a&gt;. A note at the top states (emphasis added by me):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The EPUB working group has opted for a &lt;em&gt;radical change approach&lt;/em&gt; to the addition and deletion of features in the 3.1 revision to &lt;em&gt;move the standard aggressively forward&lt;/em&gt; with the overarching goals of alignment with the Open Web Platform and simplification of the core specifications.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;As Gary McGath &lt;a href=&quot;https://fileformats.wordpress.com/2016/02/05/epub-3-1/&quot;&gt;pointed out earlier&lt;/a&gt;, this is a pretty bold statement for what is essentially a minor version. The authors of the draft also mention that they expect it “will provoke strong reactions both for and against”, and that changes that raise “strong negative reactions” from the community “will be reviewed for future drafts”.&lt;/p&gt;

&lt;p&gt;This blog post is an attempt to identify the main implications of the current draft for libraries and archives: to what degree would the proposed changes affect (long-term) accessibility? Since the current draft is particularly notable for its aggressive &lt;em&gt;removal&lt;/em&gt; of various existing &lt;em&gt;EPUB&lt;/em&gt; features, I will focus on these. These observations are all based on the 30 January 2016 draft of the changes document.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;removed-support-for-epubcfi-for-linking&quot;&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-epub31-cfi&quot;&gt;Removed support for EPUBCFI for linking&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;EPUB&lt;/em&gt; Canonical Fragment Identifier (&lt;a href=&quot;http://www.idpf.org/epub/linking/cfi/epub-cfi.html&quot;&gt;EPUBCFI&lt;/a&gt;) “defines a standardized method for referencing arbitrary content within an EPUB Publication”. Until &lt;em&gt;EPUB&lt;/em&gt; 3.0.1, Reading Systems were required to support EPUBCFI for hyperlinking within and between documents. This requirement is dropped in &lt;em&gt;EPUB&lt;/em&gt; 3.1 (although it would still be possible to use EPUBCFI for annotations and bookmarks).&lt;/p&gt;

&lt;p&gt;In principle this change could result in problems if an &lt;em&gt;EPUB&lt;/em&gt; that uses CFI for hyperlinks is opened in a 3.1 reading system: in that case the hyperlinks would not work. However, according to &lt;em&gt;EPUB&lt;/em&gt; editor Matt Garrish, &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/662#issuecomment-193793093&quot;&gt;authors simply do not use CFI for hyperlinking&lt;/a&gt;. He also mentions a check by Google on their corpus of millions of books, which only turned up a few instances of CFI use. One of these was a link in an &lt;em&gt;EPUB&lt;/em&gt; best practices book, while the remaining ones were all part of the &lt;em&gt;EPUB&lt;/em&gt; test suite documents. If these results are representative of all &lt;em&gt;EPUB&lt;/em&gt;s “in the wild”, the implications of the change would be negligible.&lt;/p&gt;

&lt;h2 id=&quot;reduced-set-of-metadata-elements-in-package-document&quot;&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-pkg-metadata&quot;&gt;Reduced set of metadata elements in Package Document&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;EPUB&lt;/em&gt; 3.1 imposes restrictions on the metadata elements that can be embedded in the Package Document. Up to version 3.0.1, the full &lt;a href=&quot;http://dublincore.org/documents/dces/&quot;&gt;Dublin Core Metadata Element Set&lt;/a&gt; was supported, whereas in 3.1 only the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dc:identifier&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dc:title&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dc:language&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dc:creator&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dc:publisher&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dc:type&lt;/code&gt; elements are allowed. Additional metadata &lt;em&gt;can&lt;/em&gt; be included, but they need to be defined in a separate resource (file), which is referenced from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;metadata&lt;/code&gt; element using the &lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-packages.html#sec-link-elem&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;link&lt;/code&gt;&lt;/a&gt; element. Below is an example that uses a &lt;a href=&quot;https://en.wikipedia.org/wiki/MARC_standards&quot;&gt;MARC&lt;/a&gt; file:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;link&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;rel=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;record&quot;&lt;/span&gt;
&lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;meta/9780000000001.xml&quot;&lt;/span&gt; 
&lt;span class=&quot;na&quot;&gt;media-type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/marc&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Complicating things further, the &lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-packages.html#sec-link-elem&quot;&gt;EPUB 3.1 Packages draft&lt;/a&gt; says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Linked resources that are not Publication Resources are not subject to Core Media Type requirements [EPUB31] and may be located inside or outside [EPUB31] the EPUB Container. Retrieval of Remote Resources is optional.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, linked metadata resources can have &lt;em&gt;any&lt;/em&gt; possible format, and they may not even be included in the &lt;em&gt;EPUB&lt;/em&gt; container. Even though these changes would have no direct consequences for long-term accessibility, they would seriously complicate document processing (e.g. ingest) workflows that rely on the metadata in the Package Document. It would also affect end users who rely on these metadata fields to &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/642#issuecomment-181450515&quot;&gt;sort and find their ebooks&lt;/a&gt;&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;removal-of-the-ncx&quot;&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-pkg-ncx&quot;&gt;Removal of the NCX&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;EPUB&lt;/em&gt; 2 documents contain the &lt;a href=&quot;http://www.idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.4.1&quot;&gt;&lt;em&gt;NCX&lt;/em&gt;&lt;/a&gt; file (“Navigation Control file for XML”), which provides a mechanism to navigate a publication. It is essentially a hierarchical table of contents. The &lt;em&gt;NCX&lt;/em&gt; was &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#ncx-superseded&quot;&gt;superseded&lt;/a&gt; by the &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-contentdocs.html#sec-xhtml-nav&quot;&gt;&lt;em&gt;Navigation Document&lt;/em&gt;&lt;/a&gt; in &lt;em&gt;EPUB&lt;/em&gt; 3.0.1. However, the &lt;em&gt;NCX&lt;/em&gt; was allowed in &lt;em&gt;EPUB&lt;/em&gt; 3.01 publications, which was useful for keeping &lt;em&gt;EPUB&lt;/em&gt; 3 publications compatible with older (&lt;em&gt;EPUB&lt;/em&gt; 2-based) reading systems&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.  The 3.1 draft forbids the &lt;em&gt;NCX&lt;/em&gt; altogether, which means that such “hybrid” &lt;em&gt;EPUB&lt;/em&gt;s are not possible without breaking the specification.&lt;/p&gt;

&lt;p&gt;The main consequence of this is that it would make &lt;em&gt;EPUB&lt;/em&gt; 3.1 files incompatible with older reading systems. More specifically, basic navigation functionality such as direct access to a chapter from the table of contents would not work.&lt;/p&gt;

&lt;p&gt;To get an approximate idea of the impact of this, I had a look at the &lt;a href=&quot;http://epubtest.org/&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt; 3 support grid&lt;/a&gt;, which gives detailed information about the support of specific &lt;em&gt;EPUB&lt;/em&gt; 3 features for commonly used devices, apps, and reading systems. &lt;a href=&quot;http://epubtest.org/testsuite/epub3/feature/toc-nav/&quot;&gt;This link&lt;/a&gt; shows support of the &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-contentdocs.html#sec-xhtml-nav-def-types-toc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;toc nav&lt;/code&gt;&lt;/a&gt; element, which defines the primary navigational hierarchy in the Navigation Document. Only 55% (34 out of 62) of all tested reading systems fully support the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;toc nav&lt;/code&gt; element, with 37% (23  out of 62) not supporting it at all&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. This may not be a big deal for users of software-based reading systems (which make up the majority of the support grid), but users of (older) E-ink readers often don’t have the option to upgrade their devices. A good example is this (now discontinued) &lt;a href=&quot;http://epubtest.org/evaluation/52/&quot;&gt;Sony e-Ink hardware reader&lt;/a&gt;. Unfortunately, E-ink devices appear to be underrepresented in the support grid. For example, it contains no information whatsoever on any of the popular Kobo readers.&lt;/p&gt;

&lt;p&gt;The proposal to remove the &lt;em&gt;NCX&lt;/em&gt; provoked strong reactions in the &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/633&quot;&gt;community review&lt;/a&gt;, with one respondent stating it would lead to &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/633#issuecomment-170398642&quot;&gt;“dropping support for millions of eInk reading systems”&lt;/a&gt;. It would also contradict &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#ncx-superseded&quot;&gt;this statement from the &lt;em&gt;EPUB&lt;/em&gt; 3.0.1 specification&lt;/a&gt; (emphasis added by me):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The NCX feature defined in [OPF2] is superseded by the EPUB Navigation Document [ContentDocs301]. &lt;em&gt;EPUB 3 Publications&lt;/em&gt; may include an NCX (as defined in OPF 2.0.1) for EPUB 2 Reading System forwards compatibility purposes, but EPUB 3 Reading Systems must ignore the NCX.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The explicit reference to &lt;em&gt;EPUB 3 Publications&lt;/em&gt; (&lt;strong&gt;not&lt;/strong&gt; &lt;em&gt;EPUB 3.0.1 Publications&lt;/em&gt;!!) implies that the statement applies to &lt;em&gt;EPUB&lt;/em&gt; 3 in general. Removing the &lt;em&gt;NCX&lt;/em&gt; in another &lt;em&gt;EPUB&lt;/em&gt; 3 release would be at odds with this.&lt;/p&gt;

&lt;h2 id=&quot;removal-of-the-guide-element&quot;&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-pkg-guide&quot;&gt;Removal of the guide Element&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.6&quot;&gt;&lt;em&gt;guide&lt;/em&gt; element&lt;/a&gt; was an optional data structure in &lt;em&gt;EPUB&lt;/em&gt; 2 that provided “convenient access” to structural components of a publication. It was &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#sec-guide-elem&quot;&gt;deprecated&lt;/a&gt; in &lt;em&gt;EPUB&lt;/em&gt; 3.0.1. Without any data on the actual usage of this feature, it is difficult to say much about the impact of its complete removal (this was also &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/644#issuecomment-191706305&quot;&gt;pointed out by one respondent to the community review&lt;/a&gt;).&lt;/p&gt;

&lt;h2 id=&quot;removal-of-the-bindings-element&quot;&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-pkg-bindings&quot;&gt;Removal of the bindings Element&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;In &lt;em&gt;EPUB&lt;/em&gt; 3.0.1 the &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#sec-bindings-elem&quot;&gt;&lt;em&gt;bindings&lt;/em&gt;&lt;/a&gt; element  could be used to define fallbacks for &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#gloss-publication-resource-foreign&quot;&gt;foreign resources&lt;/a&gt;. &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/639&quot;&gt;According to &lt;em&gt;EPUB&lt;/em&gt; editor Matt Garrish&lt;/a&gt; “this feature is not widely used or supported”, and the impact on accessibility appears to be negligible.&lt;/p&gt;

&lt;h2 id=&quot;removal-of-the-switch-element&quot;&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-cdoc-switch&quot;&gt;Removal of the switch Element&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-contentdocs.html#sec-xhtml-content-switch&quot;&gt;&lt;em&gt;switch&lt;/em&gt;&lt;/a&gt; element in &lt;em&gt;EPUB&lt;/em&gt; 3.0.1 allows one to define alternative representations of XML fragments. Here’s an example:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;epub:switch&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cmlSwitch&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
   
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;epub:case&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;required-namespace=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.xml-cml.org/schema&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;cml&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;xmlns=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.xml-cml.org/schema&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
         &lt;span class=&quot;nt&quot;&gt;&amp;lt;molecule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;sulfuric-acid&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;formula&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;f1&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;concise=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;H 2 S 1 O 4&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
         &lt;span class=&quot;nt&quot;&gt;&amp;lt;/molecule&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;/cml&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/epub:case&amp;gt;&lt;/span&gt;
   
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;epub:default&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;p&amp;gt;&lt;/span&gt;H&lt;span class=&quot;nt&quot;&gt;&amp;lt;sub&amp;gt;&lt;/span&gt;2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sub&amp;gt;&lt;/span&gt;SO&lt;span class=&quot;nt&quot;&gt;&amp;lt;sub&amp;gt;&lt;/span&gt;4&lt;span class=&quot;nt&quot;&gt;&amp;lt;/sub&amp;gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
   &lt;span class=&quot;nt&quot;&gt;&amp;lt;/epub:default&amp;gt;&lt;/span&gt;
   
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/epub:switch&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, we have a chemical formula in &lt;a href=&quot;https://en.wikipedia.org/wiki/Chemical_Markup_Language&quot;&gt;&lt;em&gt;ChemML&lt;/em&gt;&lt;/a&gt; format and in standard &lt;em&gt;HTML&lt;/em&gt;. &lt;em&gt;ChemML&lt;/em&gt; is not natively supported in &lt;em&gt;EPUB&lt;/em&gt;, so by default a reader will display the &lt;em&gt;HTML&lt;/em&gt; version. However, wrapping both in a &lt;em&gt;switch&lt;/em&gt; element would allow a &lt;em&gt;ChemML&lt;/em&gt;-capable reader to render that representation instead.&lt;/p&gt;

&lt;p&gt;I asked &lt;em&gt;EPUB&lt;/em&gt; editor Matt Garrish how an &lt;em&gt;EPUB&lt;/em&gt; 3.1-compliant reader would render content that is wrapped in a &lt;em&gt;switch&lt;/em&gt; element. He &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/637#issuecomment-193894732&quot;&gt;replied&lt;/a&gt; that by default &lt;em&gt;all&lt;/em&gt; of the switch content would be rendered. So for the example above, a reader would try to render both the &lt;em&gt;HTML&lt;/em&gt; and the &lt;em&gt;ChemML&lt;/em&gt; versions (with the latter failing on most reading systems). Matt stressed the significance of the &lt;em&gt;switch&lt;/em&gt; element, adding that people have been using it, “if not extensively”.&lt;/p&gt;

&lt;h2 id=&quot;removal-of-the-trigger-element&quot;&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-cdoc-trigger&quot;&gt;Removal of the trigger Element&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-contentdocs.html#sec-xhtml-epub-trigger&quot;&gt;&lt;em&gt;trigger&lt;/em&gt;&lt;/a&gt; element in &lt;em&gt;EPUB&lt;/em&gt; 3.0.1 is used to define simple user interfaces for multimedia content. Since this can be done natively in &lt;em&gt;HTML&lt;/em&gt; 5, it is dropped from &lt;em&gt;EPUB&lt;/em&gt; 3.1. &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/638#issue-125825057&quot;&gt;Here&lt;/a&gt; editor Matt Garrish explains that the feature is both “sparsely used” (referring to a survey of publishers) and “poorly supported”.&lt;/p&gt;

&lt;h2 id=&quot;miscellaneous-changes&quot;&gt;Miscellaneous changes&lt;/h2&gt;

&lt;p&gt;Apart from the changes above (which all &lt;em&gt;remove&lt;/em&gt; features from the existing specification), the &lt;em&gt;EPUB&lt;/em&gt; 3.1 draft also adds a number of new features, and clarifies some existing ones. I won’t go over them in detail, but here’s a brief overview:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-epub31-cmt&quot;&gt;New Core Media Types&lt;/a&gt; - adds support for &lt;a href=&quot;https://www.w3.org/TR/WOFF2/&quot;&gt;&lt;em&gt;WOFF&lt;/em&gt; 2.0&lt;/a&gt; and &lt;a href=&quot;https://en.wikipedia.org/wiki/SFNT&quot;&gt;&lt;em&gt;SFNT&lt;/em&gt;&lt;/a&gt; fonts.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-cdoc-html&quot;&gt;Addition of Support for HTML Syntax of HTML5&lt;/a&gt; - adds support for the &lt;em&gt;HTML&lt;/em&gt; Syntax of &lt;em&gt;HTML5&lt;/em&gt; (currently only the &lt;em&gt;XHTML&lt;/em&gt; syntax is supported&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-cdoc-css&quot;&gt;Replacement of EPUB Style Sheets with CSS References&lt;/a&gt; - this replaces the current “&lt;em&gt;EPUB&lt;/em&gt; Style Sheets profile” by the “official definition” of &lt;em&gt;CSS&lt;/em&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-cdoc-wcag&quot;&gt;Addition of WCAG Support&lt;/a&gt; - adds the recommendation that all &lt;em&gt;HTML&lt;/em&gt; Content Documents conform to the WCAG Guidelines to ensure they are accessible for people with disabilities.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finally, the draft contains clarifications on &lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-epub31-fallbacks&quot;&gt;Foreign Resource Fallbacks&lt;/a&gt; and &lt;a href=&quot;http://www.idpf.org/epub/31/spec/epub-changes.html#sec-cdoc-scripting&quot;&gt;Scripting Support&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;epub-31-or-epub-40&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt; 3.1 or &lt;em&gt;EPUB&lt;/em&gt; 4.0?&lt;/h2&gt;

&lt;p&gt;By now it should be clear that the aggressive removal of features in &lt;em&gt;EPUB&lt;/em&gt; 3.1 would have some far-reaching consequences. This is particularly true for the removal of the &lt;em&gt;NCX&lt;/em&gt;, which would make &lt;em&gt;EPUB&lt;/em&gt; 3.1 files incompatible with many existing E-ink readers. It would do this by ruling out the option to make backward-compatible “hybrid” files. As Gary McGath &lt;a href=&quot;https://fileformats.wordpress.com/2016/02/05/epub-3-1/&quot;&gt;pointed out earlier&lt;/a&gt;, introducing “radical changes” in what is essentially a minor version is pretty unusual practice for any standard. Nowadays, most software and file formats use some variation of &lt;a href=&quot;http://semver.org/&quot;&gt;semantic versioning&lt;/a&gt;, with version numbers that follow the general form &lt;em&gt;MAJOR.MINOR.PATCH&lt;/em&gt;. Here, each component of the version number has a well-defined meaning:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;em&gt;MAJOR&lt;/em&gt; version is increased in case of incompatible API changes,&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;MINOR&lt;/em&gt; version is increased when functionality is added in a backwards-compatible manner, and&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;PATCH&lt;/em&gt; version is increased in case of backwards-compatible bug fixes.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Since the current draft includes multiple backward-incompatible changes, this makes me wonder why the editors didn’t name it &lt;em&gt;EPUB&lt;/em&gt; 4.0 instead! Kovid Goyal, lead developer of the popular &lt;a href=&quot;https://calibre-ebook.com/&quot;&gt;Calibre&lt;/a&gt; software, made &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/642#issuecomment-182916894&quot;&gt;the following comment on this&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[I]f you want to make backwards incompatible changes, please, dont do it in a point release. From glancing over your changes document, it seems to me that you want to make several breaking changes. That’s great, EPUB 3 could do with some serious breaking. But name it EPUB 4. I really dont want to have tell my users that calibre supports EPUB 3.1 but not EPUB 3.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I agree with Kovid here. Having multiple sub-versions of &lt;em&gt;EPUB&lt;/em&gt; 3, with &lt;em&gt;some&lt;/em&gt; of them being backward-compatible with &lt;em&gt;EPUB&lt;/em&gt; 2, while this backward compatibility is explicitly ruled out in another sub-version, is bound to create a situation that will be incomprehensible for most e-book buyers. Worse, it could even undermine overall confidence in the format. For memory institutions it would also make the management of &lt;em&gt;EPUB&lt;/em&gt; 3 publications unnecessarily complicated. Not only would some &lt;em&gt;EPUB&lt;/em&gt; 3.1 files not render correctly in an &lt;em&gt;EPUB&lt;/em&gt; 3.0.1 reader, the opposite would be true as well.&lt;/p&gt;

&lt;h2 id=&quot;flashback&quot;&gt;Flashback&lt;/h2&gt;

&lt;p&gt;In my 2012 &lt;a href=&quot;https://zenodo.org/record/839711&quot;&gt;report on &lt;em&gt;EPUB&lt;/em&gt; for archival preservation&lt;/a&gt; I already mentioned the stability of the &lt;em&gt;EPUB&lt;/em&gt; format as a concern:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;EPUB 3 shows quite major changes relative to version 2, which raises concerns about the 
format’s stability over time. These concerns are reinforced by the fact that EPUB 3 is heavily 
dependent on (X)HTML5 and CSS3, both of which are unfinished “works in progress”, which 
may undergo various changes before being finalised.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These concerns are once more confirmed by the current &lt;em&gt;EPUB&lt;/em&gt; 3.1 draft. However, it remains to be seen how many of these changes will make it to the final version. The community review process is ongoing at this moment, so if you’re getting a little uneasy after reading this blog post, there’s still time to get involved and make your voice heard!&lt;/p&gt;

&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;

&lt;p&gt;Thanks to Matt Garrish for his prompt replies to my questions on Github.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.kbresearch.nl/2016/03/10/the-future-of-epub-a-first-look-at-the-epub-3-1-editors-draft/&quot;&gt;KB Research blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;The &lt;a href=&quot;https://github.com/IDPF/epub-revision/issues/642&quot;&gt;discussion thread on this topic in the issue tracker&lt;/a&gt; is worth checking out, as it contains some excellent additional observations. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;See here how &lt;a href=&quot;http://toc.oreilly.com/2013/02/oreillys-journey-to-epub-3.html&quot;&gt;O’Reilly’s keeps their &lt;em&gt;EPUB&lt;/em&gt; 3 books compatible with &lt;em&gt;EPUB&lt;/em&gt; 2 readers&lt;/a&gt; &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;This figure includes reading systems for which support is unknown &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;See the &lt;a href=&quot;https://dev.w3.org/html5/html-author/#the-html-and-xhtml-syntax&quot;&gt;&lt;em&gt;HTML5&lt;/em&gt; Reference&lt;/a&gt; for a discussion of the differences between both syntaxes &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2016/03/10/the-future-of-epub-a-first-look-at-the-epub-3-1-editors-draft</link>
                <guid>https://bitsgalore.org/2016/03/10/the-future-of-epub-a-first-look-at-the-epub-3-1-editors-draft</guid>
                <pubDate>2016-03-10T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Jpylyzer 2015 round-up</title>
                <description>&lt;p&gt;Yesterday (7 December)  we released &lt;a href=&quot;http://jpylyzer.openpreservation.org//2015/12/07/Release-of-jpylyzer-1-16-0&quot;&gt;version 1.16.0&lt;/a&gt; of the &lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;&lt;em&gt;jpylyzer&lt;/em&gt;&lt;/a&gt; tool, which is this year’s third release of the software (excluding bugfix releases). This blog post gives a brief overview of the main &lt;em&gt;jpylyzer&lt;/em&gt; improvements that have been implemented over this year.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;changes-in-xml-output&quot;&gt;Changes in XML output&lt;/h2&gt;

&lt;p&gt;The 1.14 release introduced two output improvements. Most importantly, an &lt;a href=&quot;https://en.wikipedia.org/wiki/XML_Schema_%28W3C%29&quot;&gt;XML Schema Definition&lt;/a&gt; (XSD) was created. The schema formally defines the output format, and it also makes it possible to validate output files. In addition, a namespace declaration was added. These changes make the post-processing of &lt;em&gt;jpylyzer&lt;/em&gt;’s output more straightforward.&lt;/p&gt;

&lt;p&gt;The 1.16 release added the &lt;em&gt;statusInfo&lt;/em&gt; element, which tells you whether the validation completed without any internal errors. It contains the following sub-elements:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;success&lt;/em&gt;: a Boolean flag that indicates whether the validation attempt 
completed normally (“True”) or not (“False”). A value of “False” indicates
an internal error that prevented &lt;em&gt;jpylyzer&lt;/em&gt; from validating the file.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;em&gt;failureMessage&lt;/em&gt;: if the validation attempt failed (value of &lt;em&gt;success&lt;/em&gt; 
equals “False”), this field gives further details about the reason of the failure.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means that the general structure of the output now looks like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/12/outputStructure.png&quot; alt=&quot;Jplylyzer output structure&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;recursive-traversal-of-directory-trees&quot;&gt;Recursive traversal of directory trees&lt;/h2&gt;

&lt;p&gt;Another feature that was introduced  with the 1.14 release is the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--recurse&lt;/code&gt; option. This allows one to recursively traverse a directory tree. The code for this feature was created by Adam Retter, Jaishree Davey and Laura Damian of The National Archives (UK).&lt;/p&gt;

&lt;h2 id=&quot;memory-mapping&quot;&gt;Memory mapping&lt;/h2&gt;

&lt;p&gt;The 1.15 release introduced the use of &lt;a href=&quot;https://en.wikipedia.org/wiki/Memory-mapped_file&quot;&gt;memory mapping&lt;/a&gt; for reading input images. This results in better performance when processing (very) large files. Images that would cause a memory error in previous versions are now handled without any problem. Also, the processing of very large files can be significantly faster than in earlier releases, and is less prone to freezing other processes that are simultaneously running on the machine. This improvement was suggested by Stefan Weil of Mannheim University Library, and the changes are based on a patch he submitted.&lt;/p&gt;

&lt;p&gt;Two examples illustrate the benefits of this change:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://hirise-pds.lpl.arizona.edu/download/PDS/RDR/ESP/ORB_011200_011299/ESP_011265_1560/ESP_011265_1560_RED.JP2&quot;&gt;This 2 GB image&lt;/a&gt;
 resulted in a memory error with &lt;em&gt;jpylyzer&lt;/em&gt; 1.14.2 on a Windows machine with 4 GB RAM. The latest versions process the file without problems.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;On a Linux Mint machine with 8 GB RAM, &lt;a href=&quot;http://apollo.sese.asu.edu/data/pancam/AS16/jp2/AS16-P-4102.jp2&quot;&gt;this 6.7 GB image&lt;/a&gt;
 also resulted in a memory error. Again, the current version handles the file without any problem.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This doesn’t mean that memory errors are now a thing of the past entirely; they may still occur under some circumstances. For instance, a test with the 6.7 GB image failed on a Linux Mint machine with 4 GB RAM. So it seems prudent to make sure that the amount of available RAM always exceeds the maximum image size by a fairly wide safety margin. Also, chip architecture and operating system may put further constraints on the amount of memory than can be mapped at a time.&lt;/p&gt;

&lt;h2 id=&quot;improved-exception-handling&quot;&gt;Improved exception handling&lt;/h2&gt;

&lt;p&gt;Prior to release 1.16.0, an exception during the processing of an image could cause &lt;em&gt;jpylyzer&lt;/em&gt; to crash. For example, an extremely large image can result in an internal memory error, and this would grind &lt;em&gt;jpylyzer&lt;/em&gt; to a halt. This is particularly problematic when using the new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--recurse&lt;/code&gt; option: in this case a single &lt;em&gt;jpylyzer&lt;/em&gt; invocation may involve the processing of thousands of images at a time. One single (e.g. extremely large) image could then result in unusable output; moreover, it would be difficult to identify &lt;em&gt;which&lt;/em&gt; image caused the crash in the first place! Release 1.16.0 introduces improved exception handling that allows &lt;em&gt;jpylyzer&lt;/em&gt; to handle such situations more gracefully.&lt;/p&gt;

&lt;h2 id=&quot;robustness&quot;&gt;Robustness&lt;/h2&gt;

&lt;p&gt;The combined effect of the exception handling, memory mapping and status output should make &lt;em&gt;jpylyzer&lt;/em&gt; releases from 1.16.0 onwards significantly more robust than previous versions. As an example, here’s some (simplified) output for a 6.5 GB JP2 that caused a memory error:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&apos;1.0&apos; encoding=&apos;UTF-8&apos;?&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;jpylyzer&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;toolInfo&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;toolName&amp;gt;&lt;/span&gt;jpylyzer.py&lt;span class=&quot;nt&quot;&gt;&amp;lt;/toolName&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;toolVersion&amp;gt;&lt;/span&gt;1.16.0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/toolVersion&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/toolInfo&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;fileInfo&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;fileName&amp;gt;&lt;/span&gt;AS16-P-4102.jp2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fileName&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;filePath&amp;gt;&lt;/span&gt;/home/johan/testJpylyzer/AS16-P-4102.jp2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/filePath&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;fileSizeInBytes&amp;gt;&lt;/span&gt;6745365021&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fileSizeInBytes&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;fileLastModified&amp;gt;&lt;/span&gt;Wed Dec  2 20:05:29 2015&lt;span class=&quot;nt&quot;&gt;&amp;lt;/fileLastModified&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/fileInfo&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;statusInfo&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;success&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/success&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;failureMessage&amp;gt;&lt;/span&gt;memory error (file size too large)&lt;span class=&quot;nt&quot;&gt;&amp;lt;/failureMessage&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/statusInfo&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;isValidJP2&amp;gt;&lt;/span&gt;False&lt;span class=&quot;nt&quot;&gt;&amp;lt;/isValidJP2&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;tests/&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;properties/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/jpylyzer&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Previous versions would simply crash in this situation. Now, automated workflows can simply check for the value of the &lt;em&gt;success&lt;/em&gt; field to verify the status of the validation. More importantly, if the &lt;em&gt;jpylyzer&lt;/em&gt; invocation involved multiple input files (e.g. through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--recurse&lt;/code&gt; option), errors like these will not stop the processing of the remaining files.&lt;/p&gt;

&lt;h2 id=&quot;64-bit-windows-binaries&quot;&gt;64-bit Windows binaries&lt;/h2&gt;

&lt;p&gt;Finally, from version 1.15.1 onwards we are now providing 64 bit Windows binaries of &lt;em&gt;jpylyzer&lt;/em&gt; (previously only 32-bit binaries were available).&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;Jpylyzer website&lt;/a&gt;&lt;/p&gt;
&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.kbresearch.nl/2015/12/08/jpylyzer-2015-round-up/&quot;&gt;KB Research blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2015/12/08/jpylyzer-2015-round-up</link>
                <guid>https://bitsgalore.org/2015/12/08/jpylyzer-2015-round-up</guid>
                <pubDate>2015-12-08T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Preserving optical media from the command-line</title>
                <description>&lt;p&gt;The KB has quite a large collection of offline optical media, such as CD-ROMs, DVDs and audio CDs. We’re currently investigating how to stabilise the contents of these materials using disk imaging. During the initial phase of this work I did a number of tests with various open-source tools. It’s doubtful whether we’ll end up using these same tools in our actual workflows. The main reason for this is the sheer size of the collection, which we estimated at some 15,000 physical carriers; possibly even more. At those volumes we will need a solution that involves the use of a disk robot, and these often require dedicated software (we still need to investigate this more in-depth).&lt;/p&gt;

&lt;p&gt;Nevertheless, throughout the initial testing phase I was surprised at the number of useful tools that are available in the open source domain. Since this will probably be of interest to others as well, I decided to polish a selection from my &lt;a href=&quot;https://gist.github.com/bitsgalore/1bea8f015eca21a706e7#file-notescdimaging-md&quot;&gt;rough working notes&lt;/a&gt; into a somewhat more digestible form (or so I hope!). I edited my original notes down to the following topics:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;How to figure out the device path of the CD drive&lt;/li&gt;
  &lt;li&gt;How to create an ISO image from a CD-ROM or DVD&lt;/li&gt;
  &lt;li&gt;How to check the integrity of the created ISO image&lt;/li&gt;
  &lt;li&gt;How to extract audio from an audio CD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition there’s a final section that covers my attempts at imaging a multisession / mixed mode CD. The result of this particular exercise wasn’t all that successful, but I included it anyway, as some may find it useful. All software mentioned here are open-source tools that are available for any modern Linux distribution (I’m using Linux Mint myself). Some can be used under Windows as well using &lt;a href=&quot;https://www.cygwin.com/&quot;&gt;Cygwin&lt;/a&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;2024-update-on-imaging-and-ripping-tools&quot;&gt;2024 update on imaging and ripping tools&lt;/h2&gt;

&lt;p&gt;The information in this post largely reflects the tool landscape at the time I originally wrote this in 2015. Since then, several new tools have emerged. For a more up to date (2024) discussion of imaging and ripping tools, my go-to reference would be Misty De Meo’s &lt;a href=&quot;https://www.mistys-internet.website/blog/blog/2024/09/13/the-working-archivists-guide-to-enthusiast-cd-rom-archiving-tools/&quot;&gt;The Working Archivist’s Guide to Enthusiast CD-ROM Archiving Tools&lt;/a&gt;. It recommends the &lt;a href=&quot;https://github.com/superg/redumper&quot;&gt;redumper&lt;/a&gt; tool as the first choice for archivists interested in a commandline tool. &lt;a href=&quot;https://github.com/aaru-dps/Aaru&quot;&gt;Aaru&lt;/a&gt; is another recent imaging tool. I haven’t personally done any elaborate testing with either of these tools, but if I get round to this I may include them in a future update.&lt;/p&gt;

&lt;h2 id=&quot;find-the-device-path-of-the-cd-drive-linux&quot;&gt;Find the device path of the CD drive (Linux)&lt;/h2&gt;

&lt;p&gt;The majority of the tools covered by this blog post need the device path of the CD drive as a command-line argument. Under Linux you can usually find this by inspecting the output of the following command (run this while a CD or DVD is inserted in your drive):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;mount|grep ^&lt;span class=&quot;s1&quot;&gt;&apos;/dev&apos;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;If all goes well, the result will look similar to this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/dev/sda1 on / type ext4 (rw,errors=remount-ro)
/dev/sr0 on /media/johan/REBELS_0 type iso9660
(ro,nosuid,nodev,uid=1000,gid=1000,iocharset=utf8,mode=0400,dmode=0500,uhelper=udisks2)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So, in this case the path to the CD drive is &lt;em&gt;/dev/sr0&lt;/em&gt; (if you have multiple optical drives you may also see &lt;em&gt;/dev/sr1&lt;/em&gt;, and so on).&lt;/p&gt;

&lt;h2 id=&quot;finding-the-device-path-on-windows-cygwin&quot;&gt;Finding the device path on Windows (Cygwin)&lt;/h2&gt;

&lt;p&gt;For some reason the &lt;em&gt;mount&lt;/em&gt; command doesn’t result in the printing of any device paths in CygWin. Instead, try this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;ls&lt;/span&gt; /dev/
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which produces a list of all devices:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;clipboard  dsp   mqueue  random  sda2  sdc1    stdin   ttyS2
conin      fd    null    scd0    sdb   shm     stdout  urandom
conout     full  ptmx    sda     sdb1  sr0     tty     windows
console    kmsg  pty0    sda1    sdc   stderr  ttyS0   zero
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In the above output both &lt;em&gt;sr0&lt;/em&gt; and &lt;em&gt;scd0&lt;/em&gt; point to the CD drive, and either the full paths &lt;em&gt;/dev/sr0&lt;/em&gt; or &lt;em&gt;/dev/scd0&lt;/em&gt; will work (again in case of multiple drives you may be looking for &lt;em&gt;/dev/sr1&lt;/em&gt; or &lt;em&gt;/dev/scd1&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;In all examples below I assumed that the device path is  &lt;em&gt;/dev/sr0&lt;/em&gt;; substitute your own path if necessary.&lt;/p&gt;

&lt;h2 id=&quot;create-iso-image-of-a-cd-rom-or-dvd&quot;&gt;Create ISO image of a CD-ROM or DVD&lt;/h2&gt;

&lt;p&gt;A number of tools allow you to create an (ISO&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;) image from a CD-ROM or DVD. Although generic Unix data copying and recovery tools like &lt;a href=&quot;http://linux.die.net/man/1/dd&quot;&gt;dd&lt;/a&gt; and &lt;a href=&quot;http://linux.die.net/man/1/ddrescue&quot;&gt;ddrescue&lt;/a&gt; are often used for this, various people have pointed out that the result may be unreliable because they only perform limited error checking. See for example the comments &lt;a href=&quot;http://www.commandlinefu.com/commands/view/10957/rip-a-cddvd-to-iso-format.0&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;https://pthree.org/2011/09/26/how-to-properly-create-and-burn-cddvd-iso-images-from-the-command-line/&quot;&gt;here&lt;/a&gt;; both recommend to use the &lt;a href=&quot;http://linux.die.net/man/1/readom&quot;&gt;&lt;em&gt;readom&lt;/em&gt;&lt;/a&gt; tool, which is part of the &lt;a href=&quot;https://en.wikipedia.org/wiki/Cdrkit&quot;&gt;&lt;em&gt;cdrkit&lt;/em&gt;&lt;/a&gt; library. My own experience with &lt;em&gt;readom&lt;/em&gt; is that while it works great in most cases, it is less suitable for CD-ROMs that are damaged or otherwise degraded. In those cases &lt;em&gt;ddrescue&lt;/em&gt; is often a better choice. So below I’ll first show how to use &lt;em&gt;readom&lt;/em&gt;, followed by a &lt;em&gt;ddrescue&lt;/em&gt; example that specifically addresses the recovery of a CD-ROM gives read errors in &lt;em&gt;readom&lt;/em&gt;.&lt;/p&gt;

&lt;h3 id=&quot;running-readom&quot;&gt;Running readom&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;http://linux.die.net/man/1/readom&quot;&gt;documentation&lt;/a&gt; recommends to always run &lt;em&gt;readom&lt;/em&gt; as root. Also, before running &lt;em&gt;readom&lt;/em&gt;, the CD or DVD must be unmounted&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. So, after inserting the CD or DVD, first enter this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;umount /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then run &lt;em&gt;readom&lt;/em&gt; as root:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;readom &lt;span class=&quot;nv&quot;&gt;retries&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4 &lt;span class=&quot;nv&quot;&gt;dev&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/dev/sr0 &lt;span class=&quot;nv&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;mydisk.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here the value of the &lt;em&gt;retries&lt;/em&gt; parameter defines the number of attempts that &lt;em&gt;readom&lt;/em&gt; will make at trying to recover unreadable sectors. The default value is 128, which can result in huge processing times for CDs that are seriously damaged. The &lt;em&gt;f&lt;/em&gt; parameter sets the name of the image file that is created. If all goes well the following output is printed to the screen at the end of the imaging process:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;Read  speed:  4234 kB/s (CD  24x, DVD  3x).
Write speed:     0 kB/s (CD   0x, DVD  0x).
Capacity: 309104 Blocks = 618208 kBytes = 603 MBytes = 633 prMB
Sectorsize: 2048 Bytes
Copy from SCSI (10,0,0) disk to file &apos;mydisk.iso&apos;
end:    309104
addr:   309104 cnt: 44
Time total: 259.287sec
Read 618208.00 kB at 2384.3 kB/sec.
&lt;/code&gt;&lt;/pre&gt;

&lt;h3 id=&quot;when-read-errors-happen-try-ddrescue&quot;&gt;When read errors happen: try ddrescue&lt;/h3&gt;

&lt;p&gt;If the source medium is in a bad condition or otherwise damaged, &lt;em&gt;readom&lt;/em&gt; will most likely terminate prematurely with read errors. If this happens, you may get better results with &lt;a href=&quot;http://linux.die.net/man/1/ddrescue&quot;&gt;&lt;em&gt;ddrescue&lt;/em&gt;&lt;/a&gt;. There are two reasons for this:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Unlike &lt;em&gt;readom&lt;/em&gt;, which usually gives up pretty soon after the first read error occurs, &lt;em&gt;ddrescue&lt;/em&gt; was specifically designed to deal with source media that contain errors. Consequently, it is much more persistive in such cases.&lt;/li&gt;
  &lt;li&gt;If you try to read a defective source medium using two different CD drives (let’s call them &lt;em&gt;A&lt;/em&gt; and &lt;em&gt;B&lt;/em&gt;), it is not uncommon to find that some sectors that result in read errors on drive &lt;em&gt;A&lt;/em&gt; are read correctly by drive &lt;em&gt;B&lt;/em&gt; (and vice versa). With &lt;em&gt;ddrescue&lt;/em&gt; it is possible to take advantage of this.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;a href=&quot;https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html#Optical-media&quot;&gt;&lt;em&gt;ddrescue&lt;/em&gt; Manual&lt;/a&gt; manual gives a (very concise) example of how this works. Based on this I created the following, more detailed example.&lt;/p&gt;

&lt;p&gt;First we run &lt;em&gt;ddrescue&lt;/em&gt; with the following command line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ddrescue &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 2048 &lt;span class=&quot;nt&quot;&gt;-r4&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /dev/sr0 mydisk.iso mydisk.log
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-b&lt;/code&gt; sets the block size (which is typically 2048 bytes for a CD-ROM); &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-r4&lt;/code&gt; sets the maximum number of retries in case of bad sectors to 4&lt;sup id=&quot;fnref:7&quot;&gt;&lt;a href=&quot;#fn:7&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-v&lt;/code&gt; activates verbose output mode. File &lt;em&gt;mydisk.log&lt;/em&gt; is a so-called &lt;a href=&quot;https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html#Mapfile-structure&quot;&gt;&lt;em&gt;mapfile&lt;/em&gt;&lt;/a&gt; (known as &lt;em&gt;logfile&lt;/em&gt; in &lt;em&gt;ddrescue&lt;/em&gt; versions prior to 1.20). The &lt;em&gt;mapfile&lt;/em&gt; contains (among a few other things) information on the recovery status of blocks of data. After running the above command on a faulty CD-ROM, we end up with output that looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;GNU ddrescue 1.17
About to copy 624918 kBytes from /dev/sr0 to mydisk.iso
    Starting positions: infile = 0 B,  outfile = 0 B
    Copy block size:  32 sectors       Initial skip size: 32 sectors
Sector size: 2048 Bytes

Press Ctrl-C to interrupt
rescued:   624871 kB,  errsize:   47104 B,  current rate:        0 B/s
    ipos:   508162 kB,   errors:       3,    average rate:     592 kB/s
    opos:   508162 kB,    time since last successful read:    12.3 m
Finished
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;From this we can see the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The CD-ROM contains 624918 kBytes of data (2nd line from top).&lt;/li&gt;
  &lt;li&gt;Only 624871 kBytes were extracted (‘rescued’ field)&lt;/li&gt;
  &lt;li&gt;A total of 47104 bytes were &lt;em&gt;not&lt;/em&gt; rescued (‘errorsize’ field)&lt;/li&gt;
  &lt;li&gt;3 errors occurred while reading the CD (‘errors’ field)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, it is often possible to improve the result by additional runs of &lt;em&gt;ddrescue&lt;/em&gt; using either different options, or other hardware. First we’ll see if we can improve things by re-running in &lt;a href=&quot;https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html#Direct-disc-access&quot;&gt;&lt;em&gt;direct disc access&lt;/em&gt;&lt;/a&gt; mode (this does not work on some systems, in which case &lt;em&gt;ddrescue&lt;/em&gt; will report a warning). So we use the following command&lt;sup id=&quot;fnref:6&quot;&gt;&lt;a href=&quot;#fn:6&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ddrescue &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 2048 &lt;span class=&quot;nt&quot;&gt;-r1&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /dev/sr0 mydisk.iso mydisk.log
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-d&lt;/code&gt; switch activates direct disc access, which bypasses the kernel cache (note that the number of retries is set to 1 in the above example). Running the command causes &lt;em&gt;ddrescue&lt;/em&gt; to update both the ISO and the mapfile. The screen output now looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;GNU ddrescue 1.17
About to copy 624918 kBytes from /dev/sr0 to mydisk.iso
    Starting positions: infile = 0 B,  outfile = 0 B
    Copy block size:  32 sectors       Initial skip size: 32 sectors
Sector size: 2048 Bytes

Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   624871 kB,  errsize:   47104 B,  errors:       3
Current status
rescued:   624912 kB,  errsize:    6144 B,  current rate:        0 B/s
    ipos:   508162 kB,   errors:       3,    average rate:     1706 B/s
    opos:   508162 kB,    time since last successful read:       7 s
Finished
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;What we see here:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;624871 kBytes were extracted (previously this was 624871)&lt;/li&gt;
  &lt;li&gt;Consequently ‘errsize’ has gone down from to 47104 to 6144 bytes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So this is better, but still not perfect. So let’s try if we can improve the results by using a different CD-reader. At this point I hooked up an external USB CD-drive, and moved my faulty CD-ROM from the internal reader to the external one. In this case my external drive is mapped under device path &lt;em&gt;/dev/sr2&lt;/em&gt; (re-run the aforementioned steps to find the device path if necessary). This gives the following command-line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;ddrescue &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 2048 &lt;span class=&quot;nt&quot;&gt;-r4&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-v&lt;/span&gt; /dev/sr2 mydisk.iso mydisk.log
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now the output looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;GNU ddrescue 1.17
About to copy 624918 kBytes from /dev/sr2 to mydisk.iso
    Starting positions: infile = 0 B,  outfile = 0 B
    Copy block size:  32 sectors       Initial skip size: 32 sectors
Sector size: 2048 Bytes

Press Ctrl-C to interrupt
Initial status (read from logfile)
rescued:   624916 kB,  errsize:    2048 B,  errors:       1
Current status
rescued:   624918 kB,  errsize:       0 B,  current rate:      682 B/s
    ipos:   106450 kB,   errors:       0,    average rate:      682 B/s
    opos:   106450 kB,    time since last successful read:       0 s
Finished
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;From the output we can see that after re-running &lt;em&gt;ddrescue&lt;/em&gt; with the external drive, both ‘errsize’ and the number of errors have gone down to 0. In other words: all of the contents of the CD have been rescued without any errors. Yay!&lt;/p&gt;

&lt;p&gt;In the above example I used two different CD readers that were connected to the same machine, but you could use as many readers as you like. It also possible to do the first run on one machine, transfer the ISO image and the mapfile to another machine, and then re-run &lt;em&gt;ddrescue&lt;/em&gt; there (this even works across OS platforms).&lt;/p&gt;

&lt;h2 id=&quot;check-integrity-of-iso-image-against-physical-cd-rom-or-dvd&quot;&gt;Check integrity of ISO image against physical CD-ROM or DVD&lt;/h2&gt;

&lt;p&gt;In theory you could use check the integrity of the created ISO image by computing a checksum on both the ISO file and the physical carrier, and then comparing both:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;md5sum &lt;/span&gt;mydisk.iso
&lt;span class=&quot;nb&quot;&gt;md5sum&lt;/span&gt; /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;However, in practice this comparison is not all that useful. Using dedicated data recovery tools like &lt;em&gt;readom&lt;/em&gt; and particularly &lt;em&gt;ddrescue&lt;/em&gt; often results in a more accurate capture of the data on a disc than accessing the corresponding device directly. Because of this, computing a checksum on the device using something like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;md5sum /dev/sr0&lt;/code&gt; can give unreliable results, resulting in a checksum mismatch that does not indicate any fault of the ISO image. It is worth noting that the aforementioned &lt;a href=&quot;https://pthree.org/2011/09/26/how-to-properly-create-and-burn-cddvd-iso-images-from-the-command-line/&quot;&gt;Aaron Toponce article&lt;/a&gt; claims that &lt;em&gt;readom&lt;/em&gt; already does a checksum check. If true, the additional check would be overkill (especially given that computing a checksum on a physical CD or DVD is time consuming). However, I couldn’t find any confirmation of this in either &lt;em&gt;readom&lt;/em&gt;’s documentation nor its source code (although I found the source hard to read, so I may have simply overlooked it).&lt;/p&gt;

&lt;h2 id=&quot;verify-iso-image&quot;&gt;Verify ISO image&lt;/h2&gt;

&lt;p&gt;In theory, there shouldn’t be any need for additional quality checks on an ISO image once its integrity against the physical carrier is confirmed by the checksum. However, since &lt;em&gt;cdrkit&lt;/em&gt; includes an &lt;a href=&quot;http://linux.die.net/man/8/isoinfo&quot;&gt;&lt;em&gt;isovfy&lt;/em&gt;&lt;/a&gt; tool that claims to “ verify the integrity of an iso9660 image”, I decided I might as well give it a try. It works by entering:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;isovfy mydisk.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s some example output:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;Root at extent 13, 2048 bytes
[0 0]
No errors found
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The documentation of the tool isn’t very clear about &lt;em&gt;what&lt;/em&gt; specific checks it performs. In one of my tests I fed it an ISO image that had its last 50 MB missing (truncated). This did not result in any error or warning message! Most of the reported &lt;em&gt;isovfy&lt;/em&gt; errors that I came across in my tests simply reflected the file system on the physical CD not conforming to ISO 9660 (this seems to be pretty common). Based on this it looks like &lt;em&gt;isovfy&lt;/em&gt; isn’t  very useful after all.&lt;/p&gt;

&lt;h3 id=&quot;isolyzer&quot;&gt;Isolyzer&lt;/h3&gt;

&lt;p&gt;In response to the problems I encountered with &lt;em&gt;isovfy&lt;/em&gt;, I created the &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;&lt;em&gt;isolyzer&lt;/em&gt;&lt;/a&gt; tool. &lt;em&gt;Isolyzer&lt;/em&gt; checks the file size of an ISO image against the size information in the file system headers. This can be used to identify damaged and incomplete ISO images. Currently supported file systems are ISO 9660, UDF, HFS, HFS+ and a number of hybrids of these file systems. More information on &lt;em&gt;Isolyzer&lt;/em&gt; can be found &lt;a href=&quot;/2017/01/13/detecting-broken-iso-images-introducing-isolyzer&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;/2017/07/12/update-on-isolyzer-udf-hfs-and-more&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;get-information-about-an-iso-image&quot;&gt;Get information about an ISO image&lt;/h2&gt;

&lt;h3 id=&quot;isoinfo&quot;&gt;Isoinfo&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;http://wiki.osdev.org/ISO_9660#The_Primary_Volume_Descriptor&quot;&gt;&lt;em&gt;Primary Volume Descriptor&lt;/em&gt;&lt;/a&gt; (PVD) of an ISO 9660 file system contains general information about the CD or DVD. The &lt;a href=&quot;http://linux.die.net/man/8/isoinfo&quot;&gt;&lt;em&gt;isoinfo&lt;/em&gt;&lt;/a&gt; tool (which is also part of &lt;em&gt;cdrkit&lt;/em&gt;) is able to  print the most important PVD fields to the screen:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;isoinfo &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; mydisk.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;CD-ROM is in ISO 9660 format
System id: 
Volume id: REBELS_0
Volume set id: 
Publisher id: 
Data preparer id: 
Application id: NERO - BURNING ROM
Copyright File id: 
Abstract File id: 
Bibliographic File id: 
Volume set size is: 1
Volume set sequence number is: 1
Logical block size is: 2048
Volume size is: 333151
Joliet with UCS level 3 found
NO Rock Ridge present
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;You can also run &lt;em&gt;isoinfo&lt;/em&gt; directly on the physical carrier:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;isoinfo &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To get a listing of all files and directories that are part of the filesystem, use this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;isoinfo &lt;span class=&quot;nt&quot;&gt;-f&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; mydisk.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/AUTORUN.EXE;1
/AUTORUN.INF;1
/DISK0
/LICENSE2.TXT;1
/LICENSEF.TXT;1
/LICENSEU.TXT;1
/SETUP.EXE;1
/DISK0/CONTROLS.CFG;1
/DISK0/DISK0;1
::
::
etc
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It looks like all items that are followed by &lt;em&gt;;1&lt;/em&gt; are files, and those that aren’t are directories. Also, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-l&lt;/code&gt; option can be used for a detailed list that includes additional file attributes (size, date, etc.).&lt;/p&gt;

&lt;h3 id=&quot;disktype&quot;&gt;Disktype&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;http://disktype.sourceforge.net/&quot;&gt;&lt;em&gt;disktype&lt;/em&gt;&lt;/a&gt; tool is particularly useful for identifying &lt;a href=&quot;https://en.wikipedia.org/wiki/Hybrid_disc&quot;&gt;hybrid disc&lt;/a&gt; images that combine multiple file systems. For example:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;disktype bewaarmachine.iso
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results in:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;--- bewaarmachine.iso
Regular file, size 342.1 MiB (358727680 bytes)
Apple partition map, 2 entries
Partition 1: 1 KiB (1024 bytes, 2 sectors from 1)
    Type &quot;Apple_partition_map&quot;
Partition 2: 172.4 MiB (180773376 bytes, 353073 sectors from 346957)
    Type &quot;Apple_HFS&quot;
    HFS file system
    Volume name &quot;de bewaarmachine&quot;
    Volume size 172.4 MiB (180764672 bytes, 44132 blocks of 4 KiB)
ISO9660 file system
    Volume name &quot;BEWAARMACHINE_PC&quot;
    Application &quot;TOAST ISO 9660 BUILDER COPYRIGHT (C) 1993-1996 MILES SOFTWARE GMBH - HAVE A NICE DAY&quot;
    Data size 169.4 MiB (177641472 bytes, 86739 blocks of 2 KiB)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;In this example we have an image that contains both an ISO 9660 and an Apple HFS filesystem. &lt;em&gt;Disktype&lt;/em&gt; can also be run directly on the physical carrier, using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;disktype /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;isolyzer-1&quot;&gt;Isolyzer&lt;/h3&gt;

&lt;p&gt;The  &lt;a href=&quot;https://github.com/KBNLresearch/isolyzer&quot;&gt;&lt;em&gt;isolyzer&lt;/em&gt;&lt;/a&gt; tool also gives detailed information about an ISO image, including the file systems it contains. As an example:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;isolyzer bewaarmachine.iso &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; bewaarmachine.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This results &lt;a href=&quot;https://gist.github.com/bitsgalore/bf5f9fb8e936efb9c4bc06a04443ef4a&quot;&gt;in this output file&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;rip-audio-cd-with-cdparanoia&quot;&gt;Rip audio CD with cdparanoia&lt;/h2&gt;

&lt;p&gt;The data structure of an audio CD is fundamentally different from a CD-ROM or DVD, and because of this its content cannot be stored as an ISO image. The most widely-used approach is to extract (or “rip”) the audio tracks on a CD to separate &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WAV&quot;&gt;&lt;em&gt;WAVE&lt;/em&gt;&lt;/a&gt; files. A complicating factor here is that the way audio is encoded on a CD tends to obscure (small) read errors during playback. As a result, a single linear read will not result in a reliable transfer of the audio data. More details can be found in &lt;a href=&quot;http://journal.code4lib.org/articles/9581&quot;&gt;this excellent article by Alexander Duryee&lt;/a&gt;. Duryee recommends a number of extraction tools that overcome this problem using sophisticated verification and correction functionality. One of these tools is the &lt;a href=&quot;http://linux.die.net/man/1/cdparanoia&quot;&gt;&lt;em&gt;cdparanoia&lt;/em&gt;&lt;/a&gt; ripper. As an example, the following command can be used to rip a CD in batch mode, where each track is stored as a separate &lt;em&gt;WAVE&lt;/em&gt; file:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cdparanoia &lt;span class=&quot;nt&quot;&gt;-B&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-L&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;or:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cdparanoia &lt;span class=&quot;nt&quot;&gt;-B&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-l&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-L&lt;/code&gt; switch results in the generation of a detailed log file; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-l&lt;/code&gt; produces a summary log (name:  &lt;em&gt;cdparanoia.log&lt;/em&gt;)&lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt;. File names are generated automatically like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;track01.cdda.wav
track02.cdda.wav
track03.cdda.wav
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href=&quot;https://gist.github.com/bitsgalore/1bea8f015eca21a706e7#file-cdparanoialogsummary-log&quot;&gt;Here is a link to an example log file&lt;/a&gt;. The output may look a little weird at first sight, which is because &lt;em&gt;cdparanoia&lt;/em&gt; reports all status and progress information as symbols and smilies, respectively. Their meaning is explained in the &lt;a href=&quot;http://linux.die.net/man/1/cdparanoia&quot;&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;extract-dataaudio-from-mixed-mode-and-enhanced-cds&quot;&gt;Extract data/audio from mixed mode and ‘enhanced’ CDs&lt;/h2&gt;

&lt;p&gt;Some CDs combine data and audio tracks. There are essentially two ways to do this:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Mixed_Mode_CD&quot;&gt;Mixed Mode&lt;/a&gt; CDs contain both audio and data, both of which are written into one single session. Mixed Mode was often used for ’90s video games.&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;https://en.wikipedia.org/wiki/Blue_Book_(CD_standard)&quot;&gt;Blue Book&lt;/a&gt; standard defines a way to combine audio and data tracks. Blue Book CDs contain two sessions, where the first one contains one or more audio tracks, and the second one a data track. Examples of such discs are “enhanced” audio CDs that include software or movies as bonus material. They are sometimes referred to a “CD-Extra” discs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though the &lt;em&gt;data&lt;/em&gt; part of such CDs is typically compatible with an ISO 9660 (or HFS, HFS+) file system, the audio tracks are not. Since &lt;a href=&quot;http://anjackson.net/keeping-codes/practice/developing-a-robust-migration-workflow-for-preserving-and-curating-handheld-media.html&quot;&gt;there is no good, open and mature file format to describe the contents of a CD precisely&lt;/a&gt;, such CDs pose a particular challenge. In addition, tools such as &lt;em&gt;readom&lt;/em&gt; and &lt;em&gt;ddrescue&lt;/em&gt; typically only recognise the first session on a multisession CD, which means that they are not suitable for handling this type of disc.&lt;/p&gt;

&lt;p&gt;Based on some (relatively limited) testing, the &lt;a href=&quot;http://linux.die.net/man/1/cdrdao&quot;&gt;&lt;em&gt;cdrdao&lt;/em&gt;&lt;/a&gt; tool does a good job at imaging mixed mode CDs, and a reasonable (but less than ideal) job for Blue book discs. Below are some brief notes on how to recognise both types of disc, and how to image them.&lt;/p&gt;

&lt;h3 id=&quot;identifying-mixed-mode-cds&quot;&gt;Identifying mixed-mode CDs&lt;/h3&gt;

&lt;p&gt;We can identify mixed-mode discs by running the &lt;em&gt;cd-info&lt;/em&gt; command, which is part of the &lt;a href=&quot;https://www.gnu.org/software/libcdio/libcdio.html&quot;&gt;&lt;em&gt;GNU libcdio&lt;/em&gt;&lt;/a&gt; package:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cd-info /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Now pay attention to the “CD Analysis Report” at the bottom of &lt;em&gt;cd-info&lt;/em&gt;’s output:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;CD Analysis Report

CD-TEXT for Disc:
CD-TEXT for Track  1:
CD-TEXT for Track  2:
mixed mode CD   
CD-ROM with ISO 9660 filesystem
ISO 9660: 235494 blocks, label `TNT_ROM                         &apos;
Application: TOAST ISO 9660 BUILDER COPYRIGHT (C) 1993 MILES SOFTWARE ENGINEERING - HAVE A NICE DAY
Preparer   : 
Publisher  : 
System     : APPLE COMPUTER, INC., TYPE: 0002
Volume     : TNT_ROM
Volume Set : 
mixed mode CD   XA sectors   
session #2 starts at track  2, LSN: 235719, ISO 9660 blocks: 235494
ISO 9660: 235494 blocks, label `TNT_ROM    
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Both the 4th line from the top and the 3rd line from the bottom contain the text &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mixed mode CD&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;imaging-mixed-mode-cds&quot;&gt;Imaging mixed-mode CDs&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://web.archive.org/web/20151202174511/http://linuxreviews.org/howtos/cdrecording/&quot;&gt;This article on the Linux Reviews site&lt;/a&gt; (archived link) contains instructions on how to rip a mixed-mode CD using &lt;em&gt;cdrdao&lt;/em&gt;. It involves a number of steps.&lt;/p&gt;

&lt;p&gt;First we have to unmount the disc:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;umount /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then run &lt;em&gt;cdrdao&lt;/em&gt; with the following arguments:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cdrdao read-cd &lt;span class=&quot;nt&quot;&gt;--read-raw&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--datafile&lt;/span&gt; toolstales.bin &lt;span class=&quot;nt&quot;&gt;--device&lt;/span&gt; /dev/sr0 &lt;span class=&quot;nt&quot;&gt;--driver&lt;/span&gt; generic-mmc-raw toolstales.toc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The result of this is a disc image in &lt;em&gt;BIN/TOC&lt;/em&gt; format. The &lt;em&gt;.toc&lt;/em&gt; file looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;CD_ROM_XA

CATALOG &quot;0000000000000&quot;

// Track 1
TRACK MODE2_RAW
NO COPY
DATAFILE &quot;toolstales.bin&quot; 52:20:69 // length in bytes: 554058288


// Track 2
TRACK AUDIO
NO COPY
NO PRE_EMPHASIS
TWO_CHANNEL_AUDIO
SILENCE 00:02:00
FILE &quot;toolstales.bin&quot; #554058288 0 14:58:22
START 00:02:00
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;em&gt;BIN/TOC&lt;/em&gt; format is not easily accessible, so we need to do some additional post-processing to convert the image into a more accessible format.&lt;/p&gt;

&lt;p&gt;First we convert the &lt;em&gt;.toc&lt;/em&gt; file to the &lt;em&gt;.cue&lt;/em&gt; format (as defined in Appendix A of the &lt;a href=&quot;https://web.archive.org/web/20070614044112/http://www.goldenhawk.com/download/cdrwin.pdf&quot;&gt;&lt;em&gt;CDRWIN&lt;/em&gt; User Guide&lt;/a&gt;). For this we use the &lt;em&gt;toc2cue&lt;/em&gt; tool (which is part of &lt;em&gt;cdrdao&lt;/em&gt;):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;toc2cue toolstales.toc toolstales.cue
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;We now have a &lt;em&gt;BIN/CUE&lt;/em&gt; image. On Linux we can mount this image with a virtual drive controller such as &lt;a href=&quot;https://cdemu.sourceforge.io/&quot;&gt;&lt;em&gt;cdemu&lt;/em&gt;&lt;/a&gt;. This way both the audio and the data are accessible in the same way they would be from the physical carrier.&lt;/p&gt;

&lt;p&gt;If needed it is possible to extract the data track of the &lt;em&gt;BIN/CUE&lt;/em&gt; file to an ISO image, and any audio tracks to &lt;em&gt;WAVE&lt;/em&gt; files. For this we need the &lt;a href=&quot;http://linux.die.net/man/1/bchunk&quot;&gt;&lt;em&gt;bchunk&lt;/em&gt;&lt;/a&gt; tool. Now we invoke it with the following arguments:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;bchunk &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-w&lt;/span&gt; toolstales.bin toolstales.cue toolstales
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In the example above, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-w&lt;/code&gt; option tells &lt;em&gt;bchunk&lt;/em&gt; to extract audio tracks to &lt;em&gt;WAVE&lt;/em&gt; files, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-s&lt;/code&gt; option does a byte swap on the audio samples&lt;sup id=&quot;fnref:8&quot;&gt;&lt;a href=&quot;#fn:8&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;6&lt;/a&gt;&lt;/sup&gt;. The last argument defines the base name for all created output files. In this case the command  results in 2 files: &lt;em&gt;toolstales01.iso&lt;/em&gt;, which is a mountable ISO image, and &lt;em&gt;toolstales02.wav&lt;/em&gt;, which is the audio track.&lt;/p&gt;

&lt;h3 id=&quot;identifying-enhanced-cds&quot;&gt;Identifying enhanced CDs&lt;/h3&gt;

&lt;p&gt;To identify enhanced (Blue Book) CDs, we again use &lt;em&gt;cd-info&lt;/em&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cd-info /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s the corresponding “CD Analysis Report” at the bottom of &lt;em&gt;cd-info&lt;/em&gt;’s output:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;CD Analysis Report

CD-TEXT for Disc:
CD-TEXT for Track  1:
::  ::
CD-TEXT for Track 18:
CD-Plus/Extra   
session #2 starts at track 18, LSN: 163570, ISO 9660 blocks: 170006
ISO 9660: 170006 blocks, label `NO   
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CD-Plus/Extra&lt;/code&gt;, which indicates this is a multi-session CD.&lt;/p&gt;

&lt;h3 id=&quot;imaging-enhanced-cds&quot;&gt;Imaging enhanced CDs&lt;/h3&gt;

&lt;p&gt;The procedure for imaging enhanced CDs is largely identical to the one for mixed-mode CDs. However, a major limitation here is that &lt;em&gt;cdrdao&lt;/em&gt; is not able to combine the data/audio from both sessions into one disc image: running &lt;em&gt;cdrdao&lt;/em&gt; with the command-line arguments as shown in the previous section will only create an image of the first session! However, it is possible to image both sessions separately into two image files. As an example, below are the steps I followed in an attempt to make a copy of They Might Be Giants’ &lt;a href=&quot;http://tmbw.net/wiki/No!&quot;&gt;&lt;em&gt;“No”&lt;/em&gt;&lt;/a&gt; album (which contains some video content). Again I first unmounted the disk:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;umount /dev/sr0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then I used the below command to create an image of the first session (note the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--session&lt;/code&gt; option):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cdrdao read-cd &lt;span class=&quot;nt&quot;&gt;--read-raw&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--session&lt;/span&gt; 1 &lt;span class=&quot;nt&quot;&gt;--datafile&lt;/span&gt; no1.bin &lt;span class=&quot;nt&quot;&gt;--device&lt;/span&gt; /dev/sr0 &lt;span class=&quot;nt&quot;&gt;--driver&lt;/span&gt; generic-mmc-raw no1.toc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then again for the second session:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cdrdao read-cd &lt;span class=&quot;nt&quot;&gt;--read-raw&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--session&lt;/span&gt; 2 &lt;span class=&quot;nt&quot;&gt;--datafile&lt;/span&gt; no2.bin &lt;span class=&quot;nt&quot;&gt;--device&lt;/span&gt; /dev/sr0 &lt;span class=&quot;nt&quot;&gt;--driver&lt;/span&gt; generic-mmc-raw no2.toc
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Running &lt;em&gt;cdrdao&lt;/em&gt; twice like this, I was able to create two separate images with the audio and file system data, respectively.&lt;/p&gt;

&lt;p&gt;As in the mixed-mode example, both sessions are extracted as &lt;em&gt;BIN/TOC&lt;/em&gt; files, so again we use &lt;em&gt;bchunk&lt;/em&gt; to convert to &lt;em&gt;BIN/CUE&lt;/em&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;toc2cue no1.toc no1.cue
toc2cue no2.toc no2.cue
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then use &lt;em&gt;bchunk&lt;/em&gt; to extract ISO and &lt;em&gt;WAVE&lt;/em&gt; files from the &lt;em&gt;BIN/CUE&lt;/em&gt; images:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;bchunk &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-w&lt;/span&gt; no1.bin no1.cue no1
bchunk &lt;span class=&quot;nt&quot;&gt;-s&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-w&lt;/span&gt; no2.bin no2.cue no2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;One thing to watch out for is that in most cases ISO images from an enhanced CD cannot directly be accessed or mounted. The reason for this is that the sector offsets that point to the files in the image are defined &lt;em&gt;relative to the beginning of the physical disc&lt;/em&gt;, and not &lt;em&gt;relative to the start of the image&lt;/em&gt;! More details on this can be found &lt;a href=&quot;/2017/04/25/imaging-cd-extra-blue-book-discs&quot;&gt;in this blog post&lt;/a&gt;, which also describes a workaround that allows one to access such images under Linux.&lt;/p&gt;

&lt;h2 id=&quot;additional-material&quot;&gt;Additional material&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The rough, unedited notes on which this blog post is based can be found &lt;a href=&quot;https://gist.github.com/bitsgalore/1bea8f015eca21a706e7#file-notescdimaging-md&quot;&gt;here&lt;/a&gt; (they contain some additional material that I left out here for readability).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The &lt;a href=&quot;https://www.gnu.org/software/ddrescue/manual/ddrescue_manual.html#Optical-media&quot;&gt;User Manual of &lt;em&gt;ddrescue&lt;/em&gt;&lt;/a&gt; gives some useful additional examples of how this tool can be used to recover data from a faulty CD-ROM.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://www.mistys-internet.website/blog/blog/2024/09/13/the-working-archivists-guide-to-enthusiast-cd-rom-archiving-tools/&quot;&gt;The Working Archivist’s Guide to Enthusiast CD-ROM Archiving Tools&lt;/a&gt; provides an up to date (2024) overview of the imaging and ripping tools landscape.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;revision-history&quot;&gt;Revision history&lt;/h2&gt;

&lt;h3 id=&quot;june-2019&quot;&gt;June 2019&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Revised section on checksum verification, and added explanation that in practice this is not very useful.&lt;/li&gt;
  &lt;li&gt;Added references to &lt;em&gt;Isolyzer&lt;/em&gt;.&lt;/li&gt;
  &lt;li&gt;The section on mixed-mode and multisession discs confusingly mixed up both types of carriers. I have clarified the distinction between both types of carriers, the instructions for imaging them are now in separate sections.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;november-2024&quot;&gt;November 2024&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;Added references to The Working Archivist’s Guide to Enthusiast CD-ROM Archiving Tools, Redumper and Aaru.&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.kbresearch.nl/2015/11/13/preserving-optical-media-from-the-command-line/&quot;&gt;KB Research blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Whether the resulting image will conform to ISO 9660 depends on the source medium, as the image is simply a byte-exact copy of the data on the physical carrier’s file system. So for a DVD that uses the &lt;a href=&quot;https://en.wikipedia.org/wiki/Universal_Disk_Format&quot;&gt;UDF&lt;/a&gt; format, the ISO image will be UDF as well. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;If you don’t do this you will end up with this error: &lt;em&gt;Error trying to open /dev/sr0 exclusively (Device or resource busy)… retrying in 1 second.&lt;/em&gt; &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:7&quot;&gt;
      &lt;p&gt;This is a pretty arbitrary value, and you can use whatever value you like. &lt;a href=&quot;#fnref:7&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:6&quot;&gt;
      &lt;p&gt;It is important that the names of the ISO and mapping file are identical to those used in the previous &lt;em&gt;ddrescue&lt;/em&gt; run. This allows the tool to process &lt;em&gt;only&lt;/em&gt; the problematic sectors (and skip everything else). &lt;a href=&quot;#fnref:6&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;Strangely, in my tests a parse error occurred when I specified user-defined file names here. Also, it appeared that the summary log file resulted in more detailed output than the detailed one. This needs a more in-depth look! &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:8&quot;&gt;
      &lt;p&gt;I initially omitted this, and ended up with &lt;em&gt;WAVE&lt;/em&gt; files that all played as static noise! This is a bit odd, since according to its &lt;a href=&quot;http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html&quot;&gt;specification&lt;/a&gt; the &lt;em&gt;WAVE&lt;/em&gt; format is little-Endian by definition. &lt;a href=&quot;#fnref:8&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2015/11/13/preserving-optical-media-from-the-command-line</link>
                <guid>https://bitsgalore.org/2015/11/13/preserving-optical-media-from-the-command-line</guid>
                <pubDate>2015-11-13T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Response to report on JPEG 2000 expert round table</title>
                <description>&lt;p&gt;Today my attention was caught by &lt;a href=&quot;https://www.townswebarchiving.com/2015/10/jpeg2000-and-digitisation-expert-round-table/&quot;&gt;this report of an “Expert round table” on JPEG2000 and Digitisation&lt;/a&gt;, which was published on the  TownsWeb Archiving blog. Although the report as a whole is quite balanced, it’s unfortunate that it provides fuel to some long-running myths about JPEG 2000 not supporting fully lossless compression. Since I wasn’t able to leave a comment on the Townweb blog itself, I turned my response into this small blog post.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;visually-lossless-vs-mathematically-lossless&quot;&gt;Visually lossless vs mathematically lossless&lt;/h2&gt;

&lt;p&gt;For a start, &lt;strong&gt;Dave Thompson&lt;/strong&gt; says about JPEG 2000:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Its wavelet based technology means that it can be used in a compressed format which is visually lossless.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;While this statement is true, it may confuse some people, since  it doesn’t mention that &lt;em&gt;mathematically&lt;/em&gt; lossless compression is supported as well (in which case decoding the image returns the &lt;em&gt;exact&lt;/em&gt; pixel values as they were prior to compression). Also, “visually lossless” compression is just a lossy compression that results in compression errors that are not detectable to the eye (see also the definition &lt;a href=&quot;http://www.digitizationguidelines.gov/term.php?term=compressionvisuallylossless&quot;&gt;here&lt;/a&gt;). This is not unique to JPEG 2000, and there’s nothing that stops you from implementing visually lossless compression with “ordinary” JPEG, even though this would be pretty inefficient when compared to JPEG 2000.&lt;/p&gt;

&lt;h2 id=&quot;suitability-for-preservation&quot;&gt;Suitability for preservation&lt;/h2&gt;

&lt;p&gt;More seriously, &lt;strong&gt;Paul Sugden&lt;/strong&gt; questions JPEG 2000’s suitability as a preservation format. The reason behind his concerns is a conference talk he attended:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;I attended a conference last year where an expert seemed to demonstrate that it was not possible to convert a JPEG2000 image precisely back to the original lossless TIFF file from which it was created. His example showed that after the retro conversion the file size of the newly created TIFF was different to the original TIFF as captured, 
and there were also visible differences between the images (albeit minor differences).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;From this he concludes:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;In my opinion, a file format that does not offer accurate retro-conversion back to precisely the image that was originally captured certainly cannot be seen as a reliable preservation format.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is quite a bold statement, especially given that it is based on only &lt;em&gt;one&lt;/em&gt; report (of which Sugden doesn’t provide any specific information). Having done some pretty extensive testing with different encoders and decoders, I suspect the &lt;em&gt;real&lt;/em&gt; problem here is just some bug in a specific encoder. I’ve encountered similar problems myself (e.g. see &lt;a href=&quot;https://github.com/bitsgalore/jpegToLosslessJP2&quot;&gt;here&lt;/a&gt;), but those are just software bugs, which say &lt;em&gt;nothing about the format itself&lt;/em&gt;, nor about its  general suitability as a preservation format.&lt;/p&gt;

&lt;p&gt;So, without any supporting evidence I don’t see much that justifies the sweeping generalisations that are made in the blog post (but I’m open to be proven wrong!).&lt;/p&gt;

&lt;h2 id=&quot;verifying-lossless-image-migrations&quot;&gt;Verifying lossless image migrations&lt;/h2&gt;

&lt;p&gt;To be clear: JPEG 2000 fully supports completely lossless compression. The “losslessness” is also easy to verify using pixel-wise comparisons between source and destinaton images (e.g. using ImageMagick’s “compare” tool, some examples &lt;a href=&quot;https://github.com/bitsgalore/jpegToLosslessJP2&quot;&gt;here&lt;/a&gt;). Not all encoders handle things like embedded metadata and ICC profiles equally well. Also, JPEG 2000’s baseline JP2 format has some restrictions on the embedding of ICC profiles, which can be a problem in very specific cases (e.g. when the ICC profile  of the source TIFF you want to convert is not supported by JP2).&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2015/10/19/Response-to-report-on-JPEG-2000-expert-round-table</link>
                <guid>https://bitsgalore.org/2015/10/19/Response-to-report-on-JPEG-2000-expert-round-table</guid>
                <pubDate>2015-10-19T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Why PDF/A validation matters, even if you don't have PDF/A - Part 2</title>
                <description>&lt;p&gt;This is the second and final instalment of a 2-part blog on the use of PDF/A validators for identifying preservation risks in PDF. You can read the first part &lt;a href=&quot;/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa&quot;&gt;here&lt;/a&gt;. In Part 1 I showed how PDF/A validators can be used to identify preservation risks in a PDF. I illustrated this with an example that uses the PDF/A validator component of Adobe Acrobat’s Preflight tool. Needless to say, Acrobat is  not scalabe to situations where you need to verify large volumes of PDFs. Luckily, several stand-alone PDF/A validators exist that are designed especially to do just that.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;apache-preflight&quot;&gt;Apache Preflight&lt;/h2&gt;

&lt;p&gt;During the &lt;a href=&quot;http://www.scape-project.eu/&quot;&gt;SCAPE&lt;/a&gt; project we did a number of experiments with the PDF/A validator that is part of the open-source &lt;a href=&quot;https://pdfbox.apache.org/&quot;&gt;Apache PDFBox&lt;/a&gt; library (incidentally it is also called Preflight). Throwing the PDF of our last example at Apache Preflight results in the following output&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot; standalone=&quot;no&quot;?&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;preflight&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Jpeg_linked.pdf&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;executionTimeMS&amp;gt;&lt;/span&gt;9792&lt;span class=&quot;nt&quot;&gt;&amp;lt;/executionTimeMS&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;isValid&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDF/A1-b&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;false&lt;span class=&quot;nt&quot;&gt;&amp;lt;/isValid&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;errors&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;count=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;96&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;error&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;count=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;code&amp;gt;&lt;/span&gt;3.1.3&lt;span class=&quot;nt&quot;&gt;&amp;lt;/code&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;details&amp;gt;&lt;/span&gt;Invalid Font definition, CourierNewPSMT: FontFile entry is missing from FontDescriptor&lt;span class=&quot;nt&quot;&gt;&amp;lt;/details&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/error&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;error&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;count=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;code&amp;gt;&lt;/span&gt;7.11&lt;span class=&quot;nt&quot;&gt;&amp;lt;/code&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;details&amp;gt;&lt;/span&gt;Error on MetaData, PDF/A identification schema http://www.aiim.org/pdfa/ns/id/ is missing&lt;span class=&quot;nt&quot;&gt;&amp;lt;/details&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/error&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;error&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;count=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;3&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;code&amp;gt;&lt;/span&gt;6.2.1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/code&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;details&amp;gt;&lt;/span&gt;Action is forbidden, GoToPage isn&apos;t authorized as named action&lt;span class=&quot;nt&quot;&gt;&amp;lt;/details&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;page&amp;gt;&lt;/span&gt;0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/page&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/error&amp;gt;&lt;/span&gt;

    ::
    ::

    &lt;span class=&quot;nt&quot;&gt;&amp;lt;error&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;count=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;1&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;code&amp;gt;&lt;/span&gt;1.4.2&lt;span class=&quot;nt&quot;&gt;&amp;lt;/code&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;details&amp;gt;&lt;/span&gt;Trailer Syntax error, The trailer dictionary contains Encrypt&lt;span class=&quot;nt&quot;&gt;&amp;lt;/details&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/error&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/errors&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/preflight&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;assessment-against-a-technical-profile--policy&quot;&gt;Assessment against a technical profile / policy&lt;/h2&gt;

&lt;p&gt;By post-processing Preflight’s XML output further, it is possible to automatically evaluate PDFs against a user-defined set of features (i.e. a technical profile, equivalent to what was known as a &lt;a href=&quot;http://openpreservation.org/blog/2013/09/04/control-policies-scape-project/&quot;&gt;&lt;em&gt;control policy&lt;/em&gt;&lt;/a&gt; in the SCAPE project). This is pretty straightforward if you express all features (or policy elements) as &lt;a href=&quot;https://en.wikipedia.org/wiki/Schematron&quot;&gt;Schematron&lt;/a&gt; rules. Here’s an example of a Schematron rule that checks for encryption:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot;?&amp;gt;&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;&amp;lt;!--
Schematron rules for policy-based  validation of PDF, based on output of Apache Preflight.
--&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:schema&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;xmlns:s=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://purl.oclc.org/dsdl/schematron&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
  
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:pattern&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Checks for encryption&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;        
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/preflight/errors/error&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;not(code = &apos;1.0&apos; and contains(details,&apos;password&apos;))&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Open password&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
      &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;not(code = &apos;1.4.2&apos;)&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Encryption&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
  &lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:pattern&amp;gt;&lt;/span&gt;

&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:schema&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Rules can be defined for other features as well (e.g. multimedia, fonts), which makes it possible to test against custom policies. The figure below illustrates the general procedure:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/preflightflow.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;A simple demo (based on Shellscript) that implements the above workflow can be found &lt;a href=&quot;https://github.com/openpreserve/pdfPolicyValidate&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;test-with-govdocs1-corpus&quot;&gt;Test with Govdocs1 corpus&lt;/h2&gt;

&lt;p&gt;As part of the SCAPE work, we tested whether we could use Preflight in this way to assess a large set of PDFs. For this we used about 15,000 PDFs from the &lt;a href=&quot;http://digitalcorpora.org/corpora/govdocs&quot;&gt;Govdocs1 corpus&lt;/a&gt;. We tried to assess these PDFs against a user-defined policy, which was made up of the following elements:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;No encryption or password protection&lt;/li&gt;
  &lt;li&gt;Fonts must be embedded and complete&lt;/li&gt;
  &lt;li&gt;No JavaScript&lt;/li&gt;
  &lt;li&gt;No &lt;a href=&quot;/2013/01/09/what-do-we-mean-embedded-files-pdf&quot;&gt;embedded files&lt;/a&gt; (i.e. file attachments)&lt;/li&gt;
  &lt;li&gt;No multimedia content (audio, video, 3-D objects)&lt;/li&gt;
  &lt;li&gt;File should be valid PDF&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The somewhat disappointing result of this exercise was that only 26% of all PDFs in the dataset satisfied all criteria in our test policy! Closer inspection of the Preflight output showed the majority of validation errors that caused this to be related to fonts. Preflight is  able to report on many different font-related errors, but their exact meaning is not always clear, and neither is their impact on the rendering process. This made it difficult to establish whether the results reflected the quality of the PDFs, or perhaps our assessment was too strict on font errors.&lt;/p&gt;

&lt;h2 id=&quot;way-forward-verapdf&quot;&gt;Way forward: VeraPDF&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus&quot;&gt;original report on the Govdocs1 analysis&lt;/a&gt; ended with the following conclusions:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;These preliminary results show that policy-based assessment of PDF is possible using a combination of Apache Preflight and Schematron. However, dealing with &lt;strong&gt;font issues&lt;/strong&gt; appears to be a particular challenge. Also, the lack of reliable tools to test for &lt;strong&gt;overall conformity to PDF (e.g. ISO 32000)&lt;/strong&gt; is still a major limitation. Another limitation of this analysis is the lack of &lt;strong&gt;ground truth&lt;/strong&gt;, which makes it difficult to assess the accuracy of the results.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Earlier this year work started on &lt;a href=&quot;http://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt;, an open-source PDF/A validator that -like Preflight- will be part of the PDFBox library. Its development is funded by the EU &lt;a href=&quot;http://www.preforma-project.eu/&quot;&gt;PREFORMA&lt;/a&gt; project. The consortium that is behind the software includes the &lt;a href=&quot;http://www.pdfa.org/&quot;&gt;PDF Association&lt;/a&gt;, whose member base covers a wide spectrum of vendors that already implement PDF technology. Although still in its early stages, it’s interesting to see how the VeraPDF work could help in solving the issues that we identified as part of the SCAPE work.&lt;/p&gt;

&lt;h3 id=&quot;font-issues&quot;&gt;Font issues&lt;/h3&gt;

&lt;p&gt;As I’m writing this, font checks haven’t been implemented yet in the VeraPDF code. Nevertheless, &lt;a href=&quot;https://github.com/veraPDF/veraPDF-validation-profiles&quot;&gt;validation profiles&lt;/a&gt; already exist for a number of aspects of PDF/A. These profiles contain one or more validation rules, and each rule explicitly references its corresponding clause in the PDF/A standard. For example, have a look at &lt;a href=&quot;https://github.com/veraPDF/veraPDF-validation-profiles/blob/master/PDF_A/1b/6.2%20Graphics/6.2.4%20Images/verapdf-profile-6-2-4-t01.xml&quot;&gt;this rule on images&lt;/a&gt;:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;profile&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;xmlns=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;http://www.verapdf.org/ValidationProfile&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;model=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;org.verapdf.model.PDFA1a&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;name&amp;gt;&lt;/span&gt;ISO 19005-1:2005 - 6.2.4 Images - Alternates&lt;span class=&quot;nt&quot;&gt;&amp;lt;/name&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;description&amp;gt;&amp;lt;/description&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;creator&amp;gt;&lt;/span&gt;veraPDF Consortium&lt;span class=&quot;nt&quot;&gt;&amp;lt;/creator&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;created&amp;gt;&lt;/span&gt;2015-06-16T22:22:45Z&lt;span class=&quot;nt&quot;&gt;&amp;lt;/created&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;hash&amp;gt;&lt;/span&gt;sha-1 hash code&lt;span class=&quot;nt&quot;&gt;&amp;lt;/hash&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;rules&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;6-2-4-t01&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;object=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;PDXImage&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;description&amp;gt;&lt;/span&gt;An Image dictionary shall not contain the Alternates key&lt;span class=&quot;nt&quot;&gt;&amp;lt;/description&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;test&amp;gt;&lt;/span&gt;Alternates_size == 0&lt;span class=&quot;nt&quot;&gt;&amp;lt;/test&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;error&amp;gt;&lt;/span&gt;
                &lt;span class=&quot;nt&quot;&gt;&amp;lt;message&amp;gt;&lt;/span&gt;Alternates key is present in the Image dictionary(&lt;span class=&quot;nt&quot;&gt;&amp;lt;/message&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;/error&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;reference&amp;gt;&lt;/span&gt;
                &lt;span class=&quot;nt&quot;&gt;&amp;lt;specification&amp;gt;&lt;/span&gt;ISO19005-1&lt;span class=&quot;nt&quot;&gt;&amp;lt;/specification&amp;gt;&lt;/span&gt;
                &lt;span class=&quot;nt&quot;&gt;&amp;lt;clause&amp;gt;&lt;/span&gt;6.2.4&lt;span class=&quot;nt&quot;&gt;&amp;lt;/clause&amp;gt;&lt;/span&gt;
            &lt;span class=&quot;nt&quot;&gt;&amp;lt;/reference&amp;gt;&lt;/span&gt;
        &lt;span class=&quot;nt&quot;&gt;&amp;lt;/rule&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/rules&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/profile&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here, the fields in the &lt;em&gt;clause&lt;/em&gt; field in the &lt;em&gt;reference&lt;/em&gt; element refers to a specific clause in the PDF/A-1 (ISO 19005-1) specification. This makes the errors much easier to interpret, since they are directly linked to the standard. I expect that this will make the interpretation of font-related errors much clearer as well.&lt;/p&gt;

&lt;h3 id=&quot;conformance-to-canonical-pdf&quot;&gt;Conformance to canonical PDF&lt;/h3&gt;

&lt;p&gt;A PDF may satisfy all requirements of PDF/A, and still be broken. An example is file &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/veraPDFHiResWrongObjectID.pdf?raw=true&quot;&gt;veraPDFHiResWrongObjectID.pdf&lt;/a&gt;. If you open it in Acrobat you will see this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/wrongobjectid.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Neverheless, Apache Preflight considers this to be “valid” PDF/A:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/wrongobjectidpreflight.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The reason for this is that the structure of this file is broken at a deeper level than the (relatively high-level) PDF/A profiles&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;. Although the current funding for VeraPDF only addresses PDF/A, the &lt;a href=&quot;http://www.openpreservation.org/documents/public/veraPDF_FunctionalTechnicalSpecification_v1.0.pdf&quot;&gt;veraPDF Technical and Functional Specification&lt;/a&gt; stresses that its validation model is extensible, and this would ultimately allow more elaborate validation. From p. 16 of the document:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The veraPDF model encourages plug-ins for parsing not only PDF/A-related third-party data structures (…),
but also for other features in ISO 32000, other ISO standards for PDF such as PDF/E or PRC, images, and for embedded content such as rich media or attachments (…)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This suggests that ultimately, VeraPDF has the potential to develop into a full-fledged canonical (ISO 32000) PDF validator. Obviously this would be a huge task that would require substantial additional effort and funding, but it’s encouraging to see that the overall design already allows for such a move.&lt;/p&gt;

&lt;h3 id=&quot;ground-truth&quot;&gt;Ground truth&lt;/h3&gt;

&lt;p&gt;During the SCAPE project we often struggled to find suitable openly licensed test files. In fact, for much of the policy-based assessment work we relied on files on the &lt;a href=&quot;http://acroeng.adobe.com&quot;&gt;Adobe Acrobat Engineering website&lt;/a&gt;, which is a true treasure trove of PDFs with exotic features. Or rather &lt;em&gt;was&lt;/em&gt;, as the site’s been offline for at least several weeks now, and it’s unclear when (if?) it will be back&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. Back in 2013, the BL’s Andy Jackson already &lt;a href=&quot;https://forums.adobe.com/thread/1262403?tstart=0&quot;&gt;inquired about the license terms of those files&lt;/a&gt;, and Adobe’s response was that although the files were free to use, redistribution was not allowed. Fast-forward two years, and the files are gone! Internet Archive has &lt;a href=&quot;https://web.archive.org/web/*/http://acroeng.adobe.com&quot;&gt;several snapshots of the site&lt;/a&gt;, but they are incomplete and do not include all sample files.&lt;/p&gt;

&lt;p&gt;This poignantly illustrates the importance of test data that are available under a sufficiently open license that allows redistribution. I’m happy to see that the VeraPDF initiative includes work on the production a number of openly-licensed test corpora (see also sections CE 3.2 and TS 6.2 of the &lt;a href=&quot;http://www.openpreservation.org/documents/public/veraPDF_FunctionalTechnicalSpecification_v1.0.pdf&quot;&gt;Technical and Functional Specification&lt;/a&gt;).&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;In this blog series I’ve given a brief overview of some preservation risks of the PDF format, and I showed how PDF/A validators can be used to identify such risks, &lt;em&gt;even in files that are not really PDF/A&lt;/em&gt;. I also explained the main problems we encountered while trying to use the open-source Apache Preflight PDF/A validator to identify preservation risks in a large collection of PDFs. The new VeraPDF initiative is still in its early stages, but it appears to be addressing most of these issues. Therefore it would be interesting to apply it to some of the datasets that we used for SCAPE, once the software is more fully developed.&lt;/p&gt;

&lt;h2 id=&quot;further-resources&quot;&gt;Further resources&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa&quot;&gt;Why PDF/A validation matters, even if you don’t have PDF/A (Part 1)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Portable+Document+Format&quot;&gt;Portable Document Format in the &lt;em&gt;OPF File Format Risk Registry&lt;/em&gt;&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/openpreserve/format-corpus/tree/master/pdfCabinetOfHorrors&quot;&gt;The Archivist’s PDF Cabinet of Horrors (test corpus)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2015/07/08/why-pdfa-validation-matters-part-2/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;This is only an extract from the complete output file, which is much larger. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Preflight does not perform canonical PDF validation, but it does do some additional checks beyond PDF/A, hence the “should” rather than “must”. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;More precisely, I deliberately changed the &lt;a href=&quot;https://raw.githubusercontent.com/corkami/pics/master/PDF.png&quot;&gt;object reference&lt;/a&gt; to an image to a nonsense value. Incidentally, Acrobat Preflight &lt;em&gt;does&lt;/em&gt; detect this error, which means that it checks at least &lt;em&gt;some&lt;/em&gt; aspects of canonical PDF. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;Adobe’s web team &lt;a href=&quot;https://twitter.com/bitsgalore/status/615485792375468032&quot;&gt;are aware of the issue&lt;/a&gt;, but it’s not clear when the site will be back (if at all) &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2015/07/08/why-pdfa-validation-matters-part-2</link>
                <guid>https://bitsgalore.org/2015/07/08/why-pdfa-validation-matters-part-2</guid>
                <pubDate>2015-07-08T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Why PDF/A validation matters, even if you don't have PDF/A</title>
                <description>&lt;p&gt;This is the first instalment of a 2-part blog. It was prompted by the upcoming Digital Preservation Coalition briefing &lt;a href=&quot;http://www.dpconline.org/events/details/95-preserving-pdfs-jul15&quot;&gt;&lt;em&gt;When is a PDF not a PDF?&lt;/em&gt;&lt;/a&gt;, for which I was asked to prepare a presentation. My initial idea was to give an overview of the work we did on PDF preservation risk assessment using a PDF/A validator in the &lt;a href=&quot;http://www.scape-project.eu/&quot;&gt;SCAPE&lt;/a&gt; project. Most of this has already been &lt;a href=&quot;/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression&quot;&gt;covered&lt;/a&gt; by a &lt;a href=&quot;/2013/07/25/identification-pdf-preservation-risks-sequel&quot;&gt;series&lt;/a&gt; of &lt;a href=&quot;/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus&quot;&gt;earlier blog posts&lt;/a&gt;. Those blogs very much represent different stages of a work in progress, and I think this makes them somewhat challenging for readers who are new to the subject.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;The purpose of this 2-part blog is twofold: first it is an attempt to give an accessible overview of the earlier work on PDF preservation risks, stressing the importance of PDF/A validator tools in detecting these risks. Second, it  provides some tentative suggestions of how the ongoing work on the new &lt;a href=&quot;http://verapdf.org/&quot;&gt;VeraPDF&lt;/a&gt; PDF/A validator could close some of the gaps and limitations of the SCAPE work.&lt;/p&gt;

&lt;h2 id=&quot;preservation-risks-of-pdf&quot;&gt;Preservation risks of PDF&lt;/h2&gt;

&lt;p&gt;The PDF format has a number of features that don’t sit well with the aims of long-term preservation and accessibility. This includes encryption and password protection, external dependencies (e.g. fonts that are not embedded in a document), and reliance on external software. More details can be found in the &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Portable+Document+Format&quot;&gt;PDF entry of the OPF File Format Risk Registry&lt;/a&gt;. Below are some examples; I included download links, so you can try them out for yourself.&lt;/p&gt;

&lt;h3 id=&quot;document-open-password&quot;&gt;Document Open password&lt;/h3&gt;

&lt;p&gt;If you try to open file &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/encryption_openpassword.pdf?raw=true&quot;&gt;&lt;em&gt;encryption_openpassword.pdf&lt;/em&gt;&lt;/a&gt; in Adobe Acrobat, you end up with this dialog:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/openpassword.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Without the password, the file cannot be opened at all.&lt;/p&gt;

&lt;h3 id=&quot;print-password&quot;&gt;Print password&lt;/h3&gt;

&lt;p&gt;File &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/encryption_noprinting.pdf?raw=true&quot;&gt;&lt;em&gt;encryption_noprinting.pdf&lt;/em&gt;&lt;/a&gt; can be opened normally, but you cannot print it:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/printpassword.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;embedded-quicktime-movie&quot;&gt;Embedded Quicktime movie&lt;/h3&gt;

&lt;p&gt;File &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/embedded_video_quicktime.pdf?raw=true&quot;&gt;&lt;em&gt;embedded_video_quicktime.pdf&lt;/em&gt;&lt;/a&gt; contains multimedia content in &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Quicktime&quot;&gt;Quicktime&lt;/a&gt; format. Acrobat cannot render this format natively, and relies on an external player. This is what happened when I opened the file on my PC:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/embeddquicktime.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;After I clicked on &lt;em&gt;Get Media Player&lt;/em&gt;, I was taken &lt;a href=&quot;http://cgi.adobe.com/special/acrobat/mediaplayerfinder/mediaplayerfinder.cgi?&quot;&gt;here&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/embeddquicktime2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I wasn’t able to configure Acrobat to use a media player that supports Quicktime &lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h3 id=&quot;external-reference-to-multimedia-file&quot;&gt;External reference to multimedia file&lt;/h3&gt;

&lt;p&gt;The file &lt;a href=&quot;https://web.archive.org/web/20100714002808/http://acroeng.adobe.com/Test_Files/movie/movie.pdf&quot;&gt;&lt;em&gt;movie.pdf&lt;/em&gt;&lt;/a&gt; contains references to external multimedia files. If you click on any of them you get an error like this one:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/movieexternal.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;font-not-embedded&quot;&gt;Font not embedded&lt;/h3&gt;

&lt;p&gt;File &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/calistoMTNoFontsEmbedded.pdf?raw=true&quot;&gt;&lt;em&gt;calistoMTNoFontsEmbedded.pdf&lt;/em&gt;&lt;/a&gt; uses &lt;em&gt;Calisto MT&lt;/em&gt;, but the font is not embedded. Since &lt;em&gt;Calisto MT&lt;/em&gt; is a Windows system font, the file looks fine on my Windows PC:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/fontsorig.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The font does not come pre-installed with common Linux distros, and as a result the file looks quite a bit different on my Linux machine:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/fontslinux.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;3d-content&quot;&gt;3D content&lt;/h3&gt;

&lt;p&gt;The file &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/digitally_signed_3D_Portfolio.pdf?raw=true&quot;&gt;&lt;em&gt;digitally_signed_3D_Portfolio.pdf&lt;/em&gt;&lt;/a&gt; contains 3D artwork. Acrobat correctly renders the 3D content, which can be manipulated interactively by the user:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/3dacrobat.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;However, Acrobat aside, the majority of PDF readers don’t support 3D content, with the result that in other readers you may end up with something like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/3dsumatra.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;detecting-risky-features&quot;&gt;Detecting risky features&lt;/h2&gt;

&lt;p&gt;Archives or libraries may want to check their PDFs for one or more features like those shown above. Reasons for doing so include:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Pre-ingest checks against an institutional policy (e.g. an archive may not accept PDFs that are password protected)&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Profiling of existing collections for preservation risks (e.g. embedded multimedia content in hard-to-render formats)&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For this quite a few useful software tools are already available. For example, &lt;a href=&quot;https://github.com/qpdf/qpdf&quot;&gt;qpdf&lt;/a&gt; gives detailed information about encryption and password protection:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/encryptqpdf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Similarly, the &lt;a href=&quot;http://www.linuxcommand.org/man_pages/pdffonts1.html&quot;&gt;pdffonts&lt;/a&gt; tool that is part of &lt;a href=&quot;http://www.foolabs.com/xpdf/&quot;&gt;xpdf&lt;/a&gt; is useful for checking whether fonts in a PDF are embedded:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/fontsxpdf.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;As the number of features you want to check for increases, this approach becomes increasingly cumbersome: most of tools only cover &lt;em&gt;some&lt;/em&gt; features, so you rapidly end up having to deal with a multitude of software tools and output formats. So you may ask yourself if there’s a way to do this more efficiently.&lt;/p&gt;

&lt;h2 id=&quot;pdfa-validation&quot;&gt;PDF/A validation&lt;/h2&gt;

&lt;p&gt;This is where &lt;a href=&quot;https://en.wikipedia.org/wiki/PDF/A&quot;&gt;PDF/A&lt;/a&gt; enters the picture. The PDF/A standards are nothing more than a set of profiles that impose some restrictions on a PDF, ruling out features that are not well-suited to long-term accessibility. Unsurprisingly, these include the very same features that we are interested in here, such as encryption, non-embedded fonts, multimedia content, and so on. Several tools exist that compare a PDF against PDF/A and report any deviations. These PDF/A validators are typically used to verify “true” PDF/A files; however, they can also be used to detect user-specified risky features in regular PDFs.&lt;/p&gt;

&lt;p&gt;The professional version of Adobe Acrobat has a PDF/A validator built into its &lt;a href=&quot;http://help.adobe.com/en_US/acrobat/X/pro/using/WS58a04a822e3e50102bd615109794195ff-7b82.w.html&quot;&gt;Preflight&lt;/a&gt; tool. After opening a PDF in Acrobat, it allows you to verify its compliance with a number of profiles, including PDF/A (currently A-1, 2 and 3):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/acrobatpreflight1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This results in output as shown here:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/07/acrobatpreflight2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//Jpeg_linked.pdf&quot;&gt;This PDF&lt;/a&gt;&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; (which isn’t a PDF/A) violates the PDF/A-1a profile in several ways, but supposing we’re only interested in encryption and non-embedded fonts, the relevant information can be extracted from Preflight’s output quite easily. This example demonstrates the overall feasibility of identifying preservation risks with a PDF/A validator, but it is not scalabe to situations where you need to verify large volumes of PDFs. This will be the main focus of the &lt;a href=&quot;/2015/07/08/why-pdfa-validation-matters-part-2&quot;&gt;second part&lt;/a&gt; of this blog series.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Acrobat’s &lt;em&gt;Preferences&lt;/em&gt; do include some options for configuring behavior with multimedia content (explained &lt;a href=&quot;https://helpx.adobe.com/acrobat/using/playing-video-audio-multimedia-formats.html&quot;&gt;here&lt;/a&gt;), but the list of media players in the &lt;em&gt;Preferred Media Player&lt;/em&gt; dropdown list only included Windows Media Player and Adobe Flash Player. Neither of these support Quicktime. VLC Media player &lt;em&gt;does&lt;/em&gt; support Quicktime, but it is not included in the dropdown list, leaving me no way to configure it. Bummer! &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;At the time of writing the Acrobat Engineering site was down, and this particular PDF is not included in any Wayback crawls either. Bummer again! &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa</link>
                <guid>https://bitsgalore.org/2015/07/07/why-pdfa-validation-matters-even-if-you-dont-have-pdfa</guid>
                <pubDate>2015-07-07T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Top 50 file formats in the KB e-Depot</title>
                <description>&lt;p&gt;The current version of the KB’s digital repository system (&lt;a href=&quot;https://www.kb.nl/en/organisation/research-expertise/long-term-usability-of-digital-resources/the-e-depot-project-cycle&quot;&gt;e-Depot&lt;/a&gt;) doesn’t include any tools for automated &lt;a href=&quot;http://www.forensicswiki.org/wiki/File_Format_Identification&quot;&gt;file format identification&lt;/a&gt; yet. Our previous &lt;a href=&quot;https://www.kb.nl/en/organisation/research-expertise/long-term-usability-of-digital-resources/history-the-kb-and-digital-preservation&quot;&gt;DIAS&lt;/a&gt; system didn’t have identification functionality either. As a result, information on file formats in digital our collections is largely based on publisher metadata and file extensions. Neither are necessarily correct. Moreover, previous analyses revealed a number of prevalent file extensions that could not be easily linked to a specific format. One result of this situation was that we couldn’t even reliably tell to what extent patrons were able to view e-Depot content on the PCs in our reading rooms (the obviously common formats aside).&lt;/p&gt;

&lt;p&gt;To get a better view of the formats in our collection, we did an analysis of the “top 50” most prevalent file extensions in our e-Depot: what are the corresponding formats, can these formats be automatically identified, and can we render them in our reading rooms? This blog post summarises the main findings of this work.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;extension-counts&quot;&gt;Extension counts&lt;/h2&gt;

&lt;p&gt;As a first step, we compiled a list with file counts for every unique file extension in our e-Depot. Importantly, we did this for &lt;em&gt;all&lt;/em&gt; files on the file system, including main files, supplemental content and (original) metadata files. The following chart shows the number of files for every extension, sorted in descending order (note that the vertical axis has a logarithmic scale):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2015/04/distributionFormats.png&quot; alt=&quot;Distribution of file formats&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The total number of of unique extensions is no less than 1163. Somewhat surprisingly, &lt;em&gt;.gif&lt;/em&gt; turned out to be the most prevalent extension at 34 million files&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Altogether, the 10 most common extensions make up 99% of al files in the e-Depot. There is a long tail of extensions of which less than 10 file objects exist, and these make up for over half of all unique extensions. In the remainder of this blog we will take a closer look at the “top 50” of all file extensions. The full list is too large to include in this blog post, but you can view it as as a table at the following link:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://gist.github.com/bitsgalore/21028de28b7f05066585#file-extensionskbdm-md&quot;&gt;50 most prevalent formats in KB e-Depot by file extension&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;analysis-of-sample-dataset&quot;&gt;Analysis of sample dataset&lt;/h2&gt;

&lt;p&gt;For each extension we extracted about 20 sample files&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;. We then tried to identify each file with &lt;a href=&quot;https://tika.apache.org/&quot;&gt;Apache Tika&lt;/a&gt; (version 1.4) in &lt;a href=&quot;https://tika.apache.org/1.8/detection.html&quot;&gt;detector&lt;/a&gt; mode. The third column of our &lt;a href=&quot;https://gist.github.com/bitsgalore/21028de28b7f05066585#file-extensionskbdm-md&quot;&gt;table&lt;/a&gt; shows the results for each extension. A manual inspection of selected samples revealed some further details, which are listed in the fourth column of the &lt;a href=&quot;https://gist.github.com/bitsgalore/21028de28b7f05066585#file-extensionskbdm-md&quot;&gt;table&lt;/a&gt; (you may need to use the scrollbar at the bottom to view it). One interesting finding was that &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Matlab_figure&quot;&gt;Matlab Figure&lt;/a&gt; files were misidentified by Apache Tika as either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;application/x-xfig&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;image/jpeg&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Further analysis of the contents of the 22 ZIP files in our test dataset yielded some additional formats:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Extension&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;ID Tika&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Remarks&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;cif&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;text/plain&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/CIF&quot;&gt;Crystallographic Information File&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;csv&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;text/csv&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/CSV&quot;&gt;Comma-separated values&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;mol&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;text/plain&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/MOL&quot;&gt;MDL Molfile&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;tdb&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;text/plain&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/TDB&quot;&gt;Thermo-Calc Database Format&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;r&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;text/x-rsrc&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/R&quot;&gt;R source code&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;m&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;text/x-objcsrc&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Objective-C&quot;&gt;Objective-C source code&lt;/a&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Because of the small sample size (and also the fact that the ZIP files were taken from similar batches), these results cannot be taken as representative. Nevertheless, it does show that the identification of scientific text-based data formats such as &lt;em&gt;mol&lt;/em&gt; or &lt;em&gt;cif&lt;/em&gt; often isn’t very informative. Automatic identification of such formats is difficult anyway, because they typically don’t have unique patterns or header fields.&lt;/p&gt;

&lt;h2 id=&quot;accessibility-in-reading-rooms&quot;&gt;Accessibility in reading rooms&lt;/h2&gt;

&lt;p&gt;Finally we wanted to know to what extent the PCs in our reading rooms support our most common formats. To find out, we simply plugged a thumb drive with our test dataset into one of these PCs, and tried to open sample files for each extension in our “top 50” (and those found in the ZIP files as well). To make the results of this exercise easier to digest, we grouped all extensions into &lt;a href=&quot;https://gist.github.com/bitsgalore/7a758505c0bbbae3db4e#file-formatcategories-md&quot;&gt;12 format categories&lt;/a&gt;. The table below shows the main results for each category:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Category&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Rendering software in reading rooms&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Formats accessible in reading rooms?&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Image formats&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;MS Paint, Windows Photoviewer&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;PDF&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Adobe Acrobat&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Web formats&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Internet Explorer, Google Chrome&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Office formats&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Microsoft Office&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes (support for old Office formats presently not clear)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Audio&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Windows Media Player, VLC Media Player&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;No (hardware in reading rooms doesn’t support audio)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Video&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Windows Media Player, VLC Media Player&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Partially (hardware in reading rooms doesn’t support audio)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Metadata&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Internet Explorer, Notepad, Wordpad&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Executables, installers, system files&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Not applicable&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;No&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Containers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Windows Explorer, 7-Zip&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Source code / scripts&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Notepad, Wordpad&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Partially: available software doesn’t support syntax highlighting&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;(Scientific) text-based data formats&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Notepad, Wordpad&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Partially: available software doesn’t support syntax highlighting; CSV files are not imported correctly by MS Excel&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;(Scientific) binary data formats&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;No&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The main conclusion is that most formats in our “Top 50” are sufficiently accessible. Nevertheless, there is some room for improvement:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;The currently installed version of Microsoft Office (2010) does not support all previous versions of some of the Office formats. According to &lt;a href=&quot;https://technet.microsoft.com/en-us/library/dd797428%28v=office.14%29.aspx&quot;&gt;Microsoft’s documentation&lt;/a&gt; there’s no support for Powerpoint 95 presentations, and the documentation is not clear on Word 95 and earlier either. From the current analysis we cannot establish whether we have these old formats in our collection, so this may need further work in the future.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Comma-delimited text files are not read correctly by Excel. This is caused by region-specific settings of the PCs in the reading rooms, which cause Excel to expect a semicolon as a separator instead of a comma (the comma is used as a decimal separator in Dutch!). This could be improved by changing the configuration of the PCs (but a side-effect would be that semicolon-separated files would then go wrong instead!).&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;The applications that are currently available for the “plain” text formats are not that great for scripts, large data files and files that have non-Windows &lt;a href=&quot;http://en.wikipedia.org/wiki/Newline&quot;&gt;line endings&lt;/a&gt;. This could be easily solved by installing a more sophisticated text editor such as &lt;a href=&quot;http://notepad-plus-plus.org/&quot;&gt;Notepad++&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;As part of the scientific binary data category, we came across some 1800 &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Matlab_figure&quot;&gt;Matlab Figure&lt;/a&gt; files. This is a proprietary format that requires the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Matlab&quot;&gt;Matlab&lt;/a&gt; software, which is not available in our reading rooms. So, essentially these files are not accessible to our users. Whether we will take any action on this is a different matter, since Matlab licences are expensive and the number of files is relatively small anyway.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://gist.github.com/bitsgalore/21028de28b7f05066585#file-extensionskbdm-md&quot;&gt;50 most prevalent formats in KB e-Depot by file extension&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://gist.github.com/bitsgalore/4326300f185eec3d6d48#file-edepotfextentions_v3-md&quot;&gt;All file extensions in e-Depot and corresponding file counts&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Victor van der Wolf prepared the file extension counts; Danny Stephan prepared the database queries for the sample dataset. Barbara Sierman came up with the initial idea of a “file formats top 50”.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.kbresearch.nl/2015/04/29/top-50-file-formats-in-the-kb-e-depot/&quot;&gt;KB Research blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Most of these are tiny images that are part of XML representations of scientific papers (mostly mathematical equations). &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;The dataset is not  representative of the collection as a whole because of its limited size, and the sub-optimal sampling procedure that was used. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2015/04/29/top-50-file-formats-in-the-kb-e-depot</link>
                <guid>https://bitsgalore.org/2015/04/29/top-50-file-formats-in-the-kb-e-depot</guid>
                <pubDate>2015-04-29T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Policy-based assessment of EPUB with Epubcheck</title>
                <description>&lt;p&gt;Back in 2012 the KB conducted a first &lt;a href=&quot;https://zenodo.org/record/839711&quot;&gt;investigation&lt;/a&gt; of the suitability of the &lt;em&gt;EPUB&lt;/em&gt; format for long-term preservation. The KB will soon start receiving publications in this format, and in anticipation of this, our Collection Care department has formulated a policy on the minimum requirements an &lt;em&gt;EPUB&lt;/em&gt; must meet to ensure long-term accessibility. The policy largely follows the recommendations from the 2012 report. This blog explores to what extent it is possible to automatically assess the &lt;em&gt;EPUB&lt;/em&gt;s that we receive against our policy using a combination of the &lt;a href=&quot;https://github.com/idpf/epubcheck&quot;&gt;&lt;em&gt;Epubcheck&lt;/em&gt;&lt;/a&gt; tool and &lt;a href=&quot;http://www.schematron.com/&quot;&gt;&lt;em&gt;Schematron&lt;/em&gt;&lt;/a&gt; rules.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;kb-epub-policy&quot;&gt;KB &lt;em&gt;EPUB&lt;/em&gt; policy&lt;/h2&gt;

&lt;p&gt;The KB’s policy on &lt;em&gt;EPUB&lt;/em&gt;  is made up of the following objectives:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;File must be valid &lt;em&gt;EPUB&lt;/em&gt; (either version 2 or 3)&lt;/p&gt;

    &lt;p&gt;&lt;em&gt;Rationale&lt;/em&gt;: this minimises the risk of interoperability problems.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;File may not contain DRM or encryption&lt;/p&gt;

    &lt;p&gt;&lt;em&gt;Rationale&lt;/em&gt;: this minimises the risk that files become inaccessible. An edge case here is &lt;a href=&quot;http://www.idpf.org/epub/30/spec/epub30-ocf.html#font-obfuscation&quot;&gt;font obfuscation&lt;/a&gt;, which mangles some leading bytes in embedded fonts. This technology is merely meant as a stumbling block to discourage third parties from re-using embedded fonts, and it doesn pose a serious threat to long-term accessibility.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;File may not contain foreign resources&lt;/p&gt;

    &lt;p&gt;&lt;em&gt;Rationale&lt;/em&gt;: the &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#sec-core-media-types&quot;&gt;Core Media Types&lt;/a&gt; define a set of file formats that must be supported by all conforming &lt;em&gt;EPUB&lt;/em&gt;  readers. &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#gloss-publication-resource-foreign&quot;&gt;Foreign resources&lt;/a&gt; are resources that are not part of this set, and the KB’s policy is to not accept them. This requirement minimises the risk of accepting files that contain content that may not be rendered correctly by some readers.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;File may not contain &lt;em&gt;DTBook&lt;/em&gt; content&lt;/p&gt;

    &lt;p&gt;&lt;em&gt;Rationale&lt;/em&gt;: &lt;em&gt;EPUB&lt;/em&gt;  2 offered the option to use the &lt;a href=&quot;http://www.niso.org/workrooms/daisy/Z39-86-2005.html&quot;&gt;&lt;em&gt;DTBook&lt;/em&gt;&lt;/a&gt; (&lt;em&gt;DAISY&lt;/em&gt; Digital Talking Book) format as an alternative to &lt;em&gt;XHTML&lt;/em&gt; 1.1. Support for &lt;em&gt;DTBook&lt;/em&gt; was &lt;a href=&quot;http://www.idpf.org/epub/30/spec/epub30-changes.html#sec-removals-dtbook&quot;&gt;dropped in &lt;em&gt;EPUB&lt;/em&gt; 3&lt;/a&gt;. Support is already limited with current &lt;em&gt;EPUB&lt;/em&gt; reading software: both the popular &lt;a href=&quot;http://calibre-ebook.com/&quot;&gt;&lt;em&gt;Calibre&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;http://readium.org/&quot;&gt;&lt;em&gt;Readium&lt;/em&gt;&lt;/a&gt; viewers are unable to process &lt;em&gt;EPUBS&lt;/em&gt; with &lt;em&gt;DTBook&lt;/em&gt; content (although my &lt;a href=&quot;https://en.wikipedia.org/wiki/Sony_Reader#2012_Model_.28Discontinued_late_2013.29&quot;&gt;Sony Reader device&lt;/a&gt; handles them without problems). This does not bode well for the future.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;automated-conformance-checking&quot;&gt;Automated conformance checking&lt;/h2&gt;

&lt;p&gt;To check if an &lt;em&gt;EPUB&lt;/em&gt; conforms to the above policy, we need to:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;test for validity against the format’s standard;&lt;/li&gt;
  &lt;li&gt;extract technical information that tells us something about DRM and file resources inside the &lt;em&gt;EPUB&lt;/em&gt;;&lt;/li&gt;
  &lt;li&gt;assess the results of steps 1 and 2 against our policy.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/idpf/epubcheck&quot;&gt;&lt;em&gt;Epubcheck&lt;/em&gt;&lt;/a&gt; validator is the obvious candidate for steps 1 and 2. Since &lt;em&gt;Epubcheck&lt;/em&gt; is capable of reporting its results in XML format, we can use &lt;a href=&quot;http://www.schematron.com/&quot;&gt;&lt;em&gt;Schematron&lt;/em&gt;&lt;/a&gt; rules for the final assessment step. The general approach is similar to earlier work on the &lt;a href=&quot;/2012/09/04/automated-assessment-jp2-against-technical-profile/&quot;&gt;JP2&lt;/a&gt; and &lt;a href=&quot;/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus/&quot;&gt;PDF&lt;/a&gt; formats, as well as the British Library’s &lt;a href=&quot;https://github.com/openpreserve/flint&quot;&gt;Flint&lt;/a&gt; tool.&lt;/p&gt;

&lt;h2 id=&quot;test-data&quot;&gt;Test data&lt;/h2&gt;

&lt;p&gt;For testing, we first need a corpus of files that are known violate one or more objectives of our policy. As this turned out to be more difficult than expected, I created a small &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests&quot;&gt;set of test files&lt;/a&gt;. Some of the files in this dataset were created from scratch; others were taken directly or adapted from existing openly licensed datasets. The following table lists the main characteristics of the files in the dataset&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Test&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Epub version&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_minimal.epub?raw=true&quot;&gt;Minimal&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Basic file with one text resource and one image&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_minimal_encryption.epub?raw=true&quot;&gt;Encryption&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Fake encrypted file that includes &lt;em&gt;encryption.xml&lt;/em&gt; resource in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;META-INF&lt;/code&gt;, indicating that main text resource is encrypted&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub30_font_obfuscation.epub?raw=true&quot;&gt;Font obfuscation&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Includes fonts that are obfuscated (which results in &lt;em&gt;hasEncryption&lt;/em&gt; in epubcheck). Taken from &lt;a href=&quot;https://code.google.com/p/epub-samples/&quot;&gt;EPUB 3 Sample Documents&lt;/a&gt; (&lt;a href=&quot;https://code.google.com/p/epub-samples/downloads/detail?name=wasteland-otf-obf-20120118.epub&amp;amp;can=2&amp;amp;q=&quot;&gt;&lt;em&gt;wasteland with OTF fonts, obfuscated&lt;/em&gt;&lt;/a&gt;).&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_foreign_resource_no_fallback.epub?raw=true&quot;&gt;Foreign resource without fallback&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Includes JP2 image, which is a format that is not on the list of Core Media Types&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;build/epub20_foreign_resource_with_fallback.epub?raw=true&quot;&gt;Foreign resource with fallback 1&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Includes JP2 image, which is a format that is not on the list of Core Media Types; fallback defined in manifest, identifier in content document&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;build/epub20_foreign_resource_with_fallback_noID.epub?raw=true&quot;&gt;Foreign resource with fallback 2&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Includes JP2 image, which is a format that is not on the list of Core Media Types; fallback defined in manifest, no identifier in content document&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_dtbook.epub?raw=true&quot;&gt;DTBook&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Includes Digital Talking Book content. Adapted from &lt;a href=&quot;https://code.google.com/p/threepress/source/browse/branches/bookworm-caching/library/test-data/data/hauy.epub?r=583&quot;&gt;threepress&lt;/a&gt;, published under &lt;a href=&quot;http://opensource.org/licenses/BSD-3-Clause&quot;&gt;BSD 3&lt;/a&gt; license.&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Apart from the above files, the dataset also includes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/tree/master/content&quot;&gt;full source&lt;/a&gt; of each test file (as a directory structure);&lt;/li&gt;
  &lt;li&gt;a &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build.sh&quot;&gt;bash script&lt;/a&gt; that automatically builds (zips) all directories to &lt;em&gt;EPUB&lt;/em&gt;s;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/analyse.sh&quot;&gt;another bash script&lt;/a&gt; that analyses all &lt;em&gt;EPUB&lt;/em&gt;s with version 3 and 4 of &lt;em&gt;Epubcheck&lt;/em&gt;;&lt;/li&gt;
  &lt;li&gt;the &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/tree/master/epubcheckout&quot;&gt;full &lt;em&gt;Epubcheck&lt;/em&gt; output&lt;/a&gt; of each file.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All files are openly licensed, and by adapting the existing tests it is pretty straightforward to add new ones.&lt;/p&gt;

&lt;h2 id=&quot;analysis-with-epubcheck&quot;&gt;Analysis with Epubcheck&lt;/h2&gt;

&lt;p&gt;The first question that we need to answer here is whether &lt;em&gt;Epubcheck&lt;/em&gt;’s output is sufficiently detailed for our needs. So, as a first step I analysed all files in the dataset with &lt;a href=&quot;https://github.com/idpf/epubcheck&quot;&gt;&lt;em&gt;Epubcheck&lt;/em&gt;&lt;/a&gt;. I did this using both &lt;a href=&quot;https://github.com/IDPF/epubcheck/releases/tag/v3.0.1&quot;&gt;&lt;em&gt;Epubcheck&lt;/em&gt; 3.0.1&lt;/a&gt; (the current stable version) and &lt;a href=&quot;https://github.com/IDPF/epubcheck/releases/tag/v4.0.0-alpha11&quot;&gt;the alpha 11 release of &lt;em&gt;Epubcheck&lt;/em&gt; 4.0.0&lt;/a&gt;. The full output can be found &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/tree/master/epubcheckout&quot;&gt;here&lt;/a&gt;. In the following sections I will address each of the objectives of the KB policy.&lt;/p&gt;

&lt;h2 id=&quot;encryption-objective&quot;&gt;Encryption objective&lt;/h2&gt;

&lt;p&gt;For the ‘fake’ encrypted file &lt;em&gt;Epubcheck&lt;/em&gt;’s output contains a &lt;em&gt;hasEncryption&lt;/em&gt; property. Moreover, the &lt;em&gt;messages&lt;/em&gt; element in the output contains an error message that refers to the encrypted resource. In &lt;em&gt;Epubcheck&lt;/em&gt; 3 this is:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;ERROR: : OPS/XHTML file OEBPS/Text/pdfMigration.html cannot be decrypted
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;A double-check with a ‘real’ encrypted &lt;em&gt;EPUB&lt;/em&gt; (which is proprietary and could not be included in the dataset) confirmed that each encrypted resource produces an error message of the general form:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;ERROR: : $fileType file $fileName cannot be decrypted
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$fileType&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;$fileName&lt;/code&gt; refer to the file type and name of the affected resource. The ‘fake’ encrypted file also resulted in some additional error messages about undefined fragment identifiers, but these look like secondary errors that result from &lt;em&gt;Epubcheck&lt;/em&gt; ‘s inability to decrypt the encrypted resource.&lt;/p&gt;

&lt;p&gt;The behaviour of &lt;em&gt;Epubcheck&lt;/em&gt; 4 is similar, although the error message is slightly different:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;RSC-004, ERROR, [File &apos;OEBPS/Text/pdfMigration.html&apos; could not be decrypted.],epub20_minimal_encryption.epub
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The file with the obfuscated fonts also results in a &lt;em&gt;hasEncryption&lt;/em&gt; entry in &lt;em&gt;Epubcheck&lt;/em&gt;’s output. &lt;em&gt;Epubcheck&lt;/em&gt; (both versions 3 and 4) doesn’t provide any direct clue that the encryption in this file is limited to some obfuscated fonts. For our policy-based assessment we can therefore ignore the &lt;em&gt;hasEncryption&lt;/em&gt; entry, and simply check for the presence of “cannot be decrypted” error messages (see above).&lt;/p&gt;

&lt;h2 id=&quot;dtbook-objective&quot;&gt;DTBook objective&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Epubcheck&lt;/em&gt;’s output does not give any explicit clue to the presence of &lt;em&gt;DTBook&lt;/em&gt; content. However, &lt;em&gt;Epubcheck&lt;/em&gt; 3 does report a read error on the corresponding file resource:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;ERROR: : I/O error reading OEBPS/hauy-2005-1.xml: Stream closed
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;em&gt;Epubcheck&lt;/em&gt; 4 does not report this error. A check of the &lt;em&gt;DTBook&lt;/em&gt; resource confirmed that it is valid against &lt;a href=&quot;http://www.daisy.org/z3986/2005/dtbook-2005-2.dtd&quot;&gt;version 2 of the &lt;em&gt;DTBook&lt;/em&gt; Document Type Definition&lt;/a&gt; (I checked this using both &lt;a href=&quot;http://jhove.sourceforge.net/&quot;&gt;JHOVE&lt;/a&gt; and an &lt;a href=&quot;http://www.validome.org/xml/validate/&quot;&gt;online XML validator&lt;/a&gt;). This suggests that &lt;em&gt;Epubcheck&lt;/em&gt; 3 doesn’t properly recognise (cannot parse?) &lt;em&gt;DTBook&lt;/em&gt; content, and incorrectly flags &lt;em&gt;EPUB&lt;/em&gt;s that hold this as “Not well-formed”. The behaviour of &lt;em&gt;Epubcheck&lt;/em&gt; 4 is correct (see also &lt;a href=&quot;https://github.com/IDPF/epubcheck/issues/518&quot;&gt;this issue report&lt;/a&gt;).&lt;/p&gt;

&lt;h2 id=&quot;foreign-resources-objective&quot;&gt;Foreign resources objective&lt;/h2&gt;

&lt;p&gt;The test dataset contains 3 files with foreign resources (resources that are not on the list of Core Media Types). In the first one I simply replaced a &lt;em&gt;PNG&lt;/em&gt; image by a &lt;em&gt;JP2&lt;/em&gt; (and updated the manifest and the reference in the text accordingly). This results in the following validation error (&lt;em&gt;Epubcheck&lt;/em&gt; 3):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;ERROR: /OEBPS/Text/pdfMigration.html(20): non-standard image resource &apos;OEBPS/Images/pdfVenn.jp2&apos; of type &apos;image/jp2&apos;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And in &lt;em&gt;Epubcheck&lt;/em&gt; 4;&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;MED-003, ERROR, [Non-standard image resource of type image/jp2 found.], OEBPS/Text/pdfMigration.html (20-63) 
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This error also causes the validation to fail. The &lt;em&gt;EPUB&lt;/em&gt; specification allows the use of foreign resources, but &lt;a href=&quot;http://www.idpf.org/epub/20/spec/OPF_2.0.1_draft.htm#Section2.3.1.1&quot;&gt;only if they have a Core Media fallback&lt;/a&gt;. I created two additional test files that use the original &lt;em&gt;PNG&lt;/em&gt; image as a fallback&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;; I then updated the manifest of these files accordingly. &lt;em&gt;Epubcheck&lt;/em&gt; validates both files as “Well-formed”, but gives no information whatsoever on the presence of foreign resources. Somewhat alarmingly, both &lt;em&gt;Calibre&lt;/em&gt; and &lt;em&gt;Readium&lt;/em&gt; failed to read &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_foreign_resource_with_fallback.epub?raw=true&quot;&gt;either&lt;/a&gt; of these &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests/blob/master/build/epub20_foreign_resource_with_fallback_noID.epub?raw=true&quot;&gt;files&lt;/a&gt; correctly: the (fall back) image was not shown in both cases. As it turns out, &lt;a href=&quot;https://github.com/IDPF/epubcheck/issues/511&quot;&gt;very few &lt;em&gt;EPUB&lt;/em&gt; readers support manifest fallbacks&lt;/a&gt;, even though this feature has been part of the &lt;em&gt;EPUB&lt;/em&gt; specification for a long time (at least since &lt;em&gt;EPUB&lt;/em&gt; 2).&lt;/p&gt;

&lt;h2 id=&quot;translating-the-policy-to-schematron-rules&quot;&gt;Translating the policy to Schematron rules&lt;/h2&gt;

&lt;p&gt;If &lt;em&gt;Epubcheck&lt;/em&gt; were able to address all apects of the KB’s policy, it would be possible to translate each of its objectives into a &lt;em&gt;Schematron&lt;/em&gt; rule. As &lt;em&gt;Epubcheck&lt;/em&gt; doesn’t yet provide the required information on foreign resources and &lt;em&gt;DTBook&lt;/em&gt; content, for now we can only do this for the validity and encryption objectives. The &lt;em&gt;Schematron&lt;/em&gt; rule for validity is:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:pattern&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;wellFormed&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jh:jhove/jh:repInfo&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;(jh:status = &apos;Well-formed&apos;)&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Not well-formed epub&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:pattern&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For the encryption objective we have this:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c&quot;&gt;&amp;lt;!-- This rule rules out encrypted content, but permits font obfuscation--&amp;gt;&lt;/span&gt;  
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:pattern&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;name=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;encryptedResources&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jh:jhove/jh:repInfo/jh:messages&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;count(jh:message[contains(.,&apos;cannot be decrypted&apos;)]) = 0&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Contains encrypted resources&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:pattern&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Alternatively, we could have created a rule that uses the &lt;em&gt;hasEncryption&lt;/em&gt; property here, but that would cause any files with obfuscated fonts to fail the assessment. The corresponding schema (adapted from the BL’s &lt;a href=&quot;https://github.com/openpreserve/flint&quot;&gt;&lt;em&gt;Flint&lt;/em&gt;&lt;/a&gt; tool) can be found &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyValidate/blob/master/schemas/kbPolicy.sch&quot;&gt;here&lt;/a&gt;. It is designed to work with &lt;em&gt;Epubcheck&lt;/em&gt; 3 only &lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;demo&quot;&gt;Demo&lt;/h2&gt;

&lt;p&gt;I created a simple &lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyValidate&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt; policy-based validation demo&lt;/a&gt;. It is a shell script that validates all &lt;em&gt;EPUB&lt;/em&gt; files in a user-defined directory with &lt;em&gt;Epubcheck&lt;/em&gt;, and subsequently assesses &lt;em&gt;Epubcheck&lt;/em&gt;’s output against a user-defined schema. Note that the purpose of the script is just to demonstrate the general procedure; it is not recommended for operational use.&lt;/p&gt;

&lt;h2 id=&quot;possible-epubcheck-enhancements&quot;&gt;Possible Epubcheck enhancements&lt;/h2&gt;

&lt;p&gt;The above tests demonstrate that currently &lt;em&gt;Epubcheck&lt;/em&gt; is able to cover two aspects of the KB’s policy on &lt;em&gt;EPUB&lt;/em&gt;: validity and encryption. However, its output doesn’t provide the information we need on the presence of foreign resources and &lt;em&gt;DTBook&lt;/em&gt; content. It would be useful if &lt;em&gt;Epubcheck&lt;/em&gt; could be extended with an option that reports all resources in an &lt;em&gt;EPUB&lt;/em&gt; with their corresponding media types. This information can be extracted from the &lt;em&gt;manifest&lt;/em&gt; element of an &lt;em&gt;EPUB&lt;/em&gt;’s &lt;a href=&quot;http://www.idpf.org/epub/301/spec/epub-publications.html#sec-package-documents&quot;&gt;Package Document&lt;/a&gt;. For a file with both &lt;em&gt;DTBook&lt;/em&gt; content and a foreign resource (a &lt;em&gt;JP2&lt;/em&gt; file) this looks something like this:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;manifest&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;item&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;toc.ncx&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;ncx&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;media-type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/x-dtbncx+xml&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;item&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Images/pdfVenn.jp2&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdfVennJP2&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;media-type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;image/jp2&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;fallback=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdfVennPNG&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;item&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Images/pdfVenn.png&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;pdfVennPNG&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;media-type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;image/png&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;&amp;lt;item&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;href=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;hauy-2005-1.xml&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;id=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;opf3&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;media-type=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;application/x-dtbook+xml&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;/&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/manifest&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Some simple &lt;em&gt;Schematron&lt;/em&gt; rules on the &lt;em&gt;media-type&lt;/em&gt; attribute would make it possible to filter this for the presence of &lt;em&gt;DTBook&lt;/em&gt; content (where &lt;em&gt;media-type&lt;/em&gt; is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;application/x-dtbook+xml&lt;/code&gt;) or foreign resources (which have a &lt;em&gt;media-type&lt;/em&gt; value that is not on the Core Media Types list). Another solution would be to add properties like &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hasDTBook&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hasForeignResources&lt;/code&gt; to &lt;em&gt;Epubcheck&lt;/em&gt;’s output. This solution is less generic, but possibly more user-friendly.&lt;/p&gt;

&lt;p&gt;Finally, the presence of &lt;em&gt;DTBook&lt;/em&gt; resources &lt;sup id=&quot;fnref:5&quot;&gt;&lt;a href=&quot;#fn:5&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;5&lt;/a&gt;&lt;/sup&gt; incorrectly causes the validation to fail in &lt;em&gt;Epubcheck&lt;/em&gt; 3; this has been fixed in &lt;em&gt;Epubcheck&lt;/em&gt; 4.&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/idpf/epubcheck&quot;&gt;Epubcheck&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyTests&quot;&gt;EPUB KB policy testing dataset&lt;/a&gt;, includes full source, build and analysis scripts.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/KBNLresearch/epubPolicyValidate&quot;&gt;Policy-based validation demo based on Epubcheck&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://epubtest.org/compare/&quot;&gt;EPUB 3 Support Grid - comparison of support of EPUB 3.0 features by different  reading systems&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.kbresearch.nl/2015/03/13/policy-based-assessment-of-epub-with-epubcheck/&quot;&gt;KB Research blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;These files are all released under the Creative Commons 3.0 BY-SA license, unless stated otherwise. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;The ‘encryption’ in this file is actually fake: I merely replace the original text resource with a &lt;a href=&quot;http://linux.die.net/man/1/base64&quot;&gt;base64&lt;/a&gt; encoded representation of that file. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;They only differ in the way the JP2 image is referenced in the text, as the &lt;em&gt;EPUB&lt;/em&gt; specification is not completely clear on this. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;Doesn’t yet work with &lt;em&gt;Epubcheck&lt;/em&gt; 4 because it uses slightly different output messages (could be easily adapted). &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:5&quot;&gt;
      &lt;p&gt;These are, by the way, pretty rare. &lt;a href=&quot;#fnref:5&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2015/03/13/policy-based-assessment-of-epub-with-epubcheck</link>
                <guid>https://bitsgalore.org/2015/03/13/policy-based-assessment-of-epub-with-epubcheck</guid>
                <pubDate>2015-03-13T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Dutch newspaper wipes out articles citing fabricated sources - Internet Archive to the rescue!</title>
                <description>&lt;p&gt;Shortly before Christmas, Dutch daily newspaper &lt;em&gt;Trouw&lt;/em&gt; &lt;a href=&quot;http://www.nrc.nl/nieuws/2014/12/20/trouw-trekt-126-artikelen-van-perdiep-ramesar-in/&quot;&gt;removed 126 articles&lt;/a&gt; from its website. These articles were all authored by Perdiep Ramesar, a former journalist of the newspaper. Ramesar had been fired by &lt;em&gt;Trouw&lt;/em&gt; in November, after it turned out that many of the sources that are cited in his articles were &lt;a href=&quot;http://static1.trouw.nl/static/asset/2014/Onderzoeksrapport_bronnengebruik_Trouw_19122014_7707.pdf&quot;&gt;fabricated&lt;/a&gt;. The most notorious example was a series of pieces about the so-called “Sharia Triangle”, a neighbourhood in the city of The Hague, which Ramesar claimed was being ruled  by Sharia law. As it turned out, this story was largely based on fabricated sources. Nevertheless, it was taken at face value by most major Dutch news outlets at the time, and even prompted a &lt;a href=&quot;http://www.tweedekamer.nl/kamerstukken/detail?id=2013D34540&amp;amp;did=2013D34540&quot;&gt;parliamentary debate&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trouw&lt;/em&gt;’s decision to remove the 126  articles overnight was met with considerable criticism. For example, historian Jan Dirk Snel &lt;a href=&quot;http://jandirksnel.wordpress.com/2014/12/24/geschiedvervalsing-het-echte-schandaal-bij-trouw-is-nu-pas-begonnen/&quot;&gt;noted&lt;/a&gt; that the removal of these articles makes it impossible to check &lt;em&gt;what&lt;/em&gt; was wrong with them in the first place. Various other critics accused &lt;em&gt;Trouw&lt;/em&gt; of trying to &lt;a href=&quot;http://www.journalismlab.nl/2014/12/perdiep-gewist-gaan-trouw-en-ad-gaan-voor-geschiedvervalsing/&quot;&gt;rewrite history&lt;/a&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;internet-archive-to-the-rescue&quot;&gt;Internet Archive to the rescue?&lt;/h2&gt;

&lt;p&gt;A quick check on a handful of Ramesar’s articles revealed that quite a few were still accessible from the &lt;a href=&quot;http://archive.org/web/&quot;&gt;Internet Archive’s Wayback Machine&lt;/a&gt;. This got me curious how many out of the 126 deleted articles would still be available there. Answering this question isn’t completely straightforward, because the Wayback Machine isn’t easily searchable. In order to locate any of the deleted articles, one first needs to know its original URL (i.e. the one at &lt;em&gt;Trouw&lt;/em&gt;’s website). A &lt;a href=&quot;http://static3.trouw.nl/static/asset/2014/Artikelen_met_niet_verifieerbare_bronnen_Ramesar_2007_2014_7708.pdf&quot;&gt;list of all deleted articles&lt;/a&gt; does exist, but this only provides each article’s &lt;em&gt;title&lt;/em&gt;, without listing the full URL.&lt;/p&gt;

&lt;p&gt;However, by entering each title into a search engine (I used a combination of &lt;em&gt;Google&lt;/em&gt; and &lt;em&gt;DuckDuckGo&lt;/em&gt;&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;), I was able to recover the original URL of every article in the list. In many cases the URLs were still present in the cache of the search engine. In other cases URLs could be recovered from linking pages on the &lt;em&gt;Trouw&lt;/em&gt; website. I then wrote a simple &lt;a href=&quot;https://github.com/bitsgalore/trouwRamesarWayback/blob/master/scripts/checkLinksInWayback.py&quot;&gt;script&lt;/a&gt; to check the availability of each URL in Internet Archive’s Wayback Machine. The script is just a wrapper around Wayback’s &lt;a href=&quot;https://archive.org/help/wayback_api.php&quot;&gt;Availability JSON API&lt;/a&gt;, which is insanely handy (and really easy to use as well!). This yielded a &lt;a href=&quot;/images/2015/01/ramesarTrouwURLSWayback.csv&quot;&gt;list&lt;/a&gt; with -for each article- its status in Wayback (i.e. has it been archived), and, if so, the URL to the most recent capture.&lt;/p&gt;

&lt;h2 id=&quot;result&quot;&gt;Result&lt;/h2&gt;

&lt;p&gt;The results of the above exercise are summarised in &lt;a href=&quot;/images/2015/01/tabelRamesar.html&quot;&gt;this table&lt;/a&gt;. As it turned out, 53 out of the 126 deleted articles are still accessible from the Internet Archive. These are mostly pieces that were written from 2010 onward, and include the notorious “Sharia Triangle” ones. From the time period 2007-2009 very few articles could be found.&lt;/p&gt;

&lt;h2 id=&quot;possibly-more&quot;&gt;Possibly more?&lt;/h2&gt;

&lt;p&gt;It may be possible that more removed articles are hidden in the Internet Archive. This is because of the way the &lt;em&gt;Trouw&lt;/em&gt; website handles news items. If I understand things correctly, articles in &lt;em&gt;Trouw&lt;/em&gt; are often first published under a &lt;em&gt;news&lt;/em&gt; URL; subsequently it is moved to the &lt;em&gt;archive&lt;/em&gt; section of the website, where it is published under a different URL. By way of illustration, a  &lt;em&gt;DuckDuckGo&lt;/em&gt; search of the article &lt;em&gt;Ik kan mezelf niet veranderen in een witte man&lt;/em&gt; yielded the following URL:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.trouw.nl/tr/nl/5009/Archief/archief/article/detail/3287592/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&quot;&gt;http://www.trouw.nl/tr/nl/5009/Archief/archief/article/detail/3287592/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This &lt;em&gt;archive&lt;/em&gt; link (recognisable from the word &lt;em&gt;archief&lt;/em&gt; in the URL) cannot be found anywhere in Internet Archive. By chance I encountered a different link to the same article on the website of historian Jan Dirk Snel:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&quot;&gt;http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the &lt;em&gt;news&lt;/em&gt; link under which the article was first published, and a snapshot of it exists in the Internet Archive:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://web.archive.org/web/20141224185904/http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&quot;&gt;http://web.archive.org/web/20141224185904/http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Likewise, I expect that some articles may have slipped through the net in a similar way. Nevertheless, I think the above results are pretty good as they are!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;/images/2015/01/tabelRamesar.html&quot;&gt;Click here for the full list of removed articles, includes both original URLs and URLs in Internet Archive (if available)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/images/2015/01/ramesarTrouwURLSWayback.csv&quot;&gt;Data as comma-separated text file (UTF-8)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/trouwRamesarWayback&quot;&gt;Github repo with scripts and raw data files&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/trouwRamesarWayback/archive/master.zip&quot;&gt;Github repo (as single ZIP file)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;In order to get unscrambled links from Google, I used the following FireFox add-on: &lt;a href=&quot;https://palant.de/2011/11/28/google-yandex-search-link-fix&quot;&gt;https://palant.de/2011/11/28/google-yandex-search-link-fix&lt;/a&gt; &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2015/01/06/Dutch-newspaper-wipes-articles-Internet-Archive-rescue</link>
                <guid>https://bitsgalore.org/2015/01/06/Dutch-newspaper-wipes-articles-Internet-Archive-rescue</guid>
                <pubDate>2015-01-06T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Perdiep Ramesar in het Internet Archive</title>
                <description>&lt;p&gt;Eerder deze week &lt;a href=&quot;http://www.nrc.nl/nieuws/2014/12/20/trouw-trekt-126-artikelen-van-perdiep-ramesar-in/&quot;&gt;verwijderde dagblad &lt;em&gt;Trouw&lt;/em&gt; 126&lt;/a&gt; artikelen van haar website die geschreven waren door ontslagen journalist Perdiep Ramesar. Aanleiding hiervoor was het &lt;a href=&quot;http://static1.trouw.nl/static/asset/2014/Onderzoeksrapport_bronnengebruik_Trouw_19122014_7707.pdf&quot;&gt;onderzoek&lt;/a&gt; naar door Ramesar opgevoerde “niet traceerbare” bronnen. De beslissing van &lt;em&gt;Trouw&lt;/em&gt; om de onbetrouwbare artikelen van de site af te halen stuitte op nogal wat kritiek. Sommigen noemden het &lt;a href=&quot;http://www.journalismlab.nl/2014/12/perdiep-gewist-gaan-trouw-en-ad-gaan-voor-geschiedvervalsing/&quot;&gt;geschiedvervalsing&lt;/a&gt;. Historicus Jan Dirk Snel &lt;a href=&quot;http://jandirksnel.wordpress.com/2014/12/24/geschiedvervalsing-het-echte-schandaal-bij-trouw-is-nu-pas-begonnen/&quot;&gt;merkte terecht op&lt;/a&gt; dat nu de stukken zijn verwijderd, niemand meer kan controleren wat er eventueel wel of niet aan deugt.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;vindbaarheid-in-internet-archive&quot;&gt;Vindbaarheid in Internet Archive&lt;/h2&gt;

&lt;p&gt;Uit nieuwsgierigheid heb ik van een aantal van de verwijderde artikelen gekeken of ze nog te vinden waren in het &lt;a href=&quot;https://archive.org/&quot;&gt;Internet Archive&lt;/a&gt;. Voor sommige stukken bleek dit inderdaad het geval. Vervolgens werd ik benieuwd hoeveel van de 126 verwijderde artikelen nog vindbaar zouden zijn. Het probleem hierbij is alleen dat het Internet Archive niet echt makkelijk doorzoekbaar is. Om een artikel op het spoor te komen, heb je eigenlijk de originele URL (dus op de &lt;em&gt;Trouw&lt;/em&gt; website) nodig. &lt;em&gt;Trouw&lt;/em&gt; heeft wel een &lt;a href=&quot;http://static3.trouw.nl/static/asset/2014/Artikelen_met_niet_verifieerbare_bronnen_Ramesar_2007_2014_7708.pdf&quot;&gt;lijst met de verwijderde artikelen&lt;/a&gt; gepubliceerd, maar hierin wordt van elk artikel alleen de &lt;em&gt;titel&lt;/em&gt; vermeld, en niet de volledige link.&lt;/p&gt;

&lt;p&gt;Omdat de artikelen nog maar recent zijn verwijderd, zitten de URLs nog wel in de cache van de meeste zoekmachines. Door de titels uit de lijst met verwijderde artikelen in te voeren in &lt;em&gt;Google&lt;/em&gt; en &lt;em&gt;DuckDuckGo&lt;/em&gt;, lukte het me om van alle 126 artikelen de originele URLs te achterhalen&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Met behulp van een zelfgeschreven &lt;a href=&quot;https://github.com/bitsgalore/trouwRamesarWayback/blob/master/scripts/checkLinksInWayback.py&quot;&gt;scriptje&lt;/a&gt; heb ik vervolgens elke URL opgezocht in het Internet Archive. Dit leverde me een &lt;a href=&quot;/images/2014/12/ramesarTrouwURLSWayback.csv&quot;&gt;lijst&lt;/a&gt; op met -voor elk artikel- de status in Internet Archive (is het gearchiveerd of niet), en, indien aanwezig, de meest recent gearchiveerde versie.&lt;/p&gt;

&lt;h2 id=&quot;resultaat&quot;&gt;Resultaat&lt;/h2&gt;

&lt;p&gt;Het resultaat van de hierboven beschreven analyse heb ik samengevat in &lt;a href=&quot;/images/2014/12/tabelRamesar.html&quot;&gt;deze tabel&lt;/a&gt;. Van de 126 verwijderde artikelen zijn er 53 nog opvraagbaar in het Internet Archive. Het gaat hierbij vooral om artikelen uit 2010 en later; uit de periode 2007-2009 is nog maar weinig te vinden.&lt;/p&gt;

&lt;h2 id=&quot;nog-meer-te-vinden&quot;&gt;Nog meer te vinden?&lt;/h2&gt;

&lt;p&gt;Overigens verwacht ik dat met goed zoeken nog wel meer te vinden valt: voor zover ik het het goed begrijp, wordt een artikel op de &lt;em&gt;Trouw&lt;/em&gt; website eerst onder een &lt;em&gt;nieuwslink&lt;/em&gt; gepubliceerd; vervolgens verhuist het naar het archief, waarna het onder een &lt;em&gt;archieflink&lt;/em&gt; beschikbaar is. Een voorbeeld is het artikel &lt;em&gt;Ik kan mezelf niet veranderen in een witte man&lt;/em&gt;. Een zoekactie met &lt;em&gt;DuckDuckGo&lt;/em&gt; leverde me hiervan de volgende link op:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.trouw.nl/tr/nl/5009/Archief/archief/article/detail/3287592/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&quot;&gt;http://www.trouw.nl/tr/nl/5009/Archief/archief/article/detail/3287592/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Deze &lt;em&gt;archieflink&lt;/em&gt; is niet te vinden in Internet Archive. Op de site van Jan Dirk Snel kwam ik van hetzelfde artikel de &lt;em&gt;nieuwslink&lt;/em&gt; tegen:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&quot;&gt;http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;En die zit wel in Internet Archive:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://web.archive.org/web/20141224185904/http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&quot;&gt;http://web.archive.org/web/20141224185904/http://www.trouw.nl/tr/nl/4504/Economie/article/detail/3287689/2012/07/17/Ik-kan-mezelf-niet-veranderen-in-een-witte-man.dhtml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Er zullen dus nog wel meer artikelen op vergelijkbare wijze door het net zijn geglipt. Als lezers nog aanvullingen of correcties hebben dan hoor ik dat natuurlijk graag!&lt;/p&gt;

&lt;h2 id=&quot;ad&quot;&gt;AD&lt;/h2&gt;

&lt;p&gt;Het &lt;em&gt;AD&lt;/em&gt; is nog veel verder gegaan dan &lt;em&gt;Trouw&lt;/em&gt;, en heeft gelijk &lt;em&gt;alle&lt;/em&gt; artikelen waarvan Ramesar auteur is verwijderd. Van veel van deze stukken &lt;a href=&quot;https://www.google.nl/?q=%22Perdiep+Ramesar%22+site:ad.nl#q=%22Perdiep+Ramesar%22+site:ad.nl&quot;&gt;zijn de originele URLs nog te achterhalen via de &lt;em&gt;Google&lt;/em&gt; cache&lt;/a&gt;. Maar niet lang meer! Omdat het hier om honderden artikelen gaat, is het geen doen om de URLs allemaal handmatig op te vragen. &lt;a href=&quot;https://developers.google.com/custom-search/json-api/v1/overview&quot;&gt;&lt;em&gt;Google&lt;/em&gt; biedt een &lt;em&gt;Search API&lt;/em&gt;&lt;/a&gt; aan, en daarmee zou het mogelijk moeten zijn om dit grotendeels te automatiseren. Die URLs kun je vervolgens weer proberen terug te zoeken in Internet Archive, net zoals ik dat voor de &lt;em&gt;Trouw&lt;/em&gt; artikelen heb gedaan. Ik ga daar zelf nu geen tijd in steken, maar misschien heeft iemand anders zin om hiermee aan de slag te gaan. Enige haast is hierbij wel geboden, want binnen een paar weken zullen de links uit &lt;em&gt;Google&lt;/em&gt;’s cache verdwenen zijn, en het is maar de vraag of je het dan nog ooit terug kunt vinden.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href=&quot;/images/2014/12/tabelRamesar.html&quot;&gt;Klik hier voor de volledige lijst artikelen, inclusief originele URLs en URLs in Internet Archive (voor zover beschikbaar)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/images/2014/12/ramesarTrouwURLSWayback.csv&quot;&gt;Data als kommagescheiden tekstbestand (UTF-8)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/trouwRamesarWayback&quot;&gt;Github repository met gebruikte scripts en databestanden&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/trouwRamesarWayback/archive/master.zip&quot;&gt;ZIP bestand met alle scripts en databestanden&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Google heeft hierbij de vervelende gewoonte om niet de directe links naar de zoekresultaten te geven. Om dit te omzeilen heb ik de volgende FireFox add-on gebruikt: &lt;a href=&quot;https://palant.de/2011/11/28/google-yandex-search-link-fix&quot;&gt;https://palant.de/2011/11/28/google-yandex-search-link-fix&lt;/a&gt; &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2014/12/28/Perdiep-Ramesar-Internet-Archive</link>
                <guid>https://bitsgalore.org/2014/12/28/Perdiep-Ramesar-Internet-Archive</guid>
                <pubDate>2014-12-28T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Demise of the Dutch Blogosphere</title>
                <description>&lt;p&gt;Back in 2006, Dutch weblog &lt;a href=&quot;http://sargasso.nl/&quot;&gt;Sargasso&lt;/a&gt; started following the activity of about 260 Dutch blogs that were active at the time, mainly by looking at the frequency of new postings. &lt;!-- more --&gt; Earlier this week Sargasso &lt;a href=&quot;http://sargasso.nl/battle-blogs-war-lost/&quot;&gt;published an update of the the status of these blogs in 2014&lt;/a&gt;. They summarised the results of their analysis with this fascinating visualisation&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;iframe width=&quot;415&quot; scrolling=&quot;no&quot; height=&quot;620&quot; frameborder=&quot;0&quot; src=&quot;https://sargasso.nl/wp-content/uploads/2014/11/botb2014b.htm&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Each block in the figure represents one blog, where the color and the number of dots indicate how often new postings appear. Here’s the legend:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Symbol&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Posting frequency&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;img src=&quot;http://www.sargasso.nl/wp-content/uploads/2006/03/bf1k1.png&quot; alt=&quot;&quot; /&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;High&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;img src=&quot;http://www.sargasso.nl/wp-content/uploads/2006/03/bf2k1.png&quot; alt=&quot;&quot; /&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Occasional gaps&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;img src=&quot;http://www.sargasso.nl/wp-content/uploads/2006/03/bf3k1.png&quot; alt=&quot;&quot; /&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Gaps of one or more days&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;img src=&quot;http://www.sargasso.nl/wp-content/uploads/2006/03/bf4k1.png&quot; alt=&quot;&quot; /&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Lots of large gaps&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;img src=&quot;http://www.sargasso.nl/wp-content/uploads/2006/03/bf5k1.png&quot; alt=&quot;&quot; /&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Almost dead&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;img src=&quot;http://www.sargasso.nl/wp-content/uploads/2006/03/bf6k1.png&quot; alt=&quot;&quot; /&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Casualty of war&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The really outstanding feature here is that most of these blogs are now in the “casualty of war” category. They’re simply &lt;em&gt;gone&lt;/em&gt;; the URLs either don’t work, or they link to a completely different website. A pretty dramatic demonstration of the importance of web archiving if you ask me.&lt;/p&gt;

&lt;h2 id=&quot;link-to-original-feature-on-sargasso&quot;&gt;Link to original feature on Sargasso&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://sargasso.nl/battle-blogs-war-lost/&quot;&gt;Battle of the Blogs - The war is lost&lt;/a&gt; (in Dutch)&lt;/p&gt;

&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;If the visualisation doesn’t show, you may have a browser extension installed that blocks the “sargasso.nl” domain from which it is loaded. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2014/11/13/Demise-Of-Dutch-Blogosphere</link>
                <guid>https://bitsgalore.org/2014/11/13/Demise-Of-Dutch-Blogosphere</guid>
                <pubDate>2014-11-13T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Quattro Pro for DOS&#58; an obsolete format at last?</title>
                <description>&lt;p&gt;While browsing &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Main_Page&quot;&gt;ArchiveTeam’s File Formats Wiki&lt;/a&gt; earlier this week, I came across some entries I created there on &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Quattro_Pro&quot;&gt;Quattro Pro spreadsheets&lt;/a&gt; two years ago. At the time I had also contributed some old Quattro Pro for DOS spreadsheets (&lt;a href=&quot;http://opf-labs.org/format-corpus/office/spreadsheet/wq1/&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://opf-labs.org/format-corpus/office/spreadsheet/wq2/&quot;&gt;here&lt;/a&gt;) from my personal archives to the &lt;a href=&quot;https://github.com/openplanets/format-corpus&quot;&gt;OPF format corpus&lt;/a&gt;. Seeing those files again, I decided to spend an afternoon trying to access them using modern-day software. This turned out to be more challenging than expected. It even made me wonder whether, at long last, I had finally run into a case of the much discussed (but rarely observed) phenomenon of &lt;a href=&quot;https://openpreservation.org/blog/2010/12/22/obsolescence-overrated/&quot;&gt;format obsolescence&lt;/a&gt;. Yes, big words indeed, and if anyone would like to prove me wrong, the comments section below is your friend!&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;what-its-all-about&quot;&gt;What it’s all about&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://en.wikipedia.org/wiki/Quattro_Pro&quot;&gt;Quattro Pro&lt;/a&gt; is a spreadsheet program that was first released in 1988. It’s still around today as part of the  &lt;a href=&quot;http://www.wordperfect.com/gb/product/office-suite/&quot;&gt;WordPerfect Office suite&lt;/a&gt;. &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Quattro_Pro&quot;&gt;A number of file formats&lt;/a&gt; have been associated with the software. This blog post covers the old Quattro Pro for DOS formats:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WQ1&quot;&gt;Quattro Pro for DOS, versions 1-4 (WQ1)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WQ2&quot;&gt;Quattro Pro for DOS, versions 5.0 and 5.5 (WQ2)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;First of all, let’s have a look to what extent contemporary spreadsheet software can handle these formats.&lt;/p&gt;

&lt;h2 id=&quot;ms-excel&quot;&gt;MS Excel&lt;/h2&gt;

&lt;p&gt;Support for Quattro Pro spreadsheets (including recent versions of the format!) was removed altogether from more recent versions of Excel, as shown by this &lt;a href=&quot;http://office.microsoft.com/en-us/excel-help/file-formats-that-are-supported-in-excel-HP010352464.aspx#BMunsupportedformats&quot;&gt;overview of file formats that are not supported in Excel 2010&lt;/a&gt;. On a side note, this list also includes all versions of &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Lotus_1-2-3&quot;&gt;Lotus 1-2-3&lt;/a&gt; (which was once widely used). Older versions of Excel did offer support for the format. According to Microsoft, &lt;a href=&quot;http://office.microsoft.com/en-us/excel-help/about-opening-and-saving-files-from-other-programs-HP005253843.aspx&quot;&gt;Excel 2003 supports Quattro Pro spreadsheets&lt;/a&gt; (albeit only after installing some converter add-ons from the Microsoft Office Web site&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;). The website explicitly mentions Quattro Pro for DOS files, although it also says “there are some limitations to opening the worksheets”.&lt;/p&gt;

&lt;h2 id=&quot;libreoffice--openoffice&quot;&gt;LibreOffice / OpenOffice&lt;/h2&gt;

&lt;p&gt;According to &lt;a href=&quot;https://wiki.documentfoundation.org/Feature_Comparison:_LibreOffice_-_Microsoft_Office&quot;&gt;this overview&lt;/a&gt;, LibreOffice offers support for &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WB2&quot;&gt;Quattro Pro 6 (WB2)&lt;/a&gt; spreadsheets, but it cannot handle the older Quattro Pro for DOS formats. It doesn’t mention &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Quattro_Pro&quot;&gt;newer versions&lt;/a&gt; of the format either. The situation is the same for &lt;a href=&quot;https://wiki.openoffice.org/wiki/Documentation/OOo3_User_Guides/Getting_Started/File_formats&quot;&gt;OpenOffice&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;tests-with-quattro-pro-x7&quot;&gt;Tests with Quattro Pro X7&lt;/h2&gt;

&lt;p&gt;With neither Excel, LibreOffice or OpenOffice being able to open my spreadsheets, I went over to the WordPerfect website and &lt;a href=&quot;http://www.wordperfect.com/gb/free-trials/&quot;&gt;grabbed a trial version of the WordPerfect Office suite&lt;/a&gt; (which includes Quattro Pro X7). I then tried opening some files, all of which are available from the &lt;a href=&quot;https://github.com/openplanets/format-corpus/tree/master/office/spreadsheet&quot;&gt;spreadsheet&lt;/a&gt; section of the OPF Format Corpus.&lt;/p&gt;

&lt;h3 id=&quot;simple-numerical--text-data&quot;&gt;Simple numerical / text data&lt;/h3&gt;

&lt;p&gt;I started out by opening two versions of a spreadsheet that contains simple numerical and text data. The &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq1/KSBASE.WQ1&quot;&gt;first version&lt;/a&gt; has the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WQ1&quot;&gt;WQ1 (Quattro Pro for DOS version 1-4)&lt;/a&gt; format. The file opens without problems:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/ksbase_wq1.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I also had &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq2/KSBASE.WQ2&quot;&gt;another version of that spreadsheet&lt;/a&gt; in &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/WQ2&quot;&gt;WQ2 (Quattro Pro for DOS version 5)&lt;/a&gt; format. Opening this file produced the following result:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/ksbase_wq2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For some reason the numbers in some columns (&lt;em&gt;A&lt;/em&gt;, &lt;em&gt;C&lt;/em&gt;, &lt;em&gt;D&lt;/em&gt;, &lt;em&gt;G&lt;/em&gt;, &lt;em&gt;H&lt;/em&gt;, &lt;em&gt;I&lt;/em&gt;) aren’t displayed, but clicking on any of those calls reveals they are actually still there. Changing the formatting properties also makes them visible again, as shown below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/ksbase_wq2_fixedformatting.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So it looks like this is only a formatting issue.&lt;/p&gt;

&lt;h3 id=&quot;simple-formulas-charts&quot;&gt;Simple formulas, charts&lt;/h3&gt;

&lt;p&gt;Next I moved on to two other spreadsheets, which are a bit more interesting because they do some simple calculations&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt; and also contain charts. First I opened &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq2/KS4001.WQ2&quot;&gt;KS4001.WQ2&lt;/a&gt;; the screenshot below shows how it is rendered by Quattro Pro:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/ks4001_wq2.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The main calculation results are in cells &lt;em&gt;H17&lt;/em&gt; and &lt;em&gt;H18&lt;/em&gt;. The blue arrows highlight the cells from which these values are calculated. The calculated results are also correct. As before, two columns &lt;em&gt;appear&lt;/em&gt; to be empty, but again, clicking these cells reveals that the underlying data (numbers in column &lt;em&gt;C&lt;/em&gt;, and a calculation result in column &lt;em&gt;D&lt;/em&gt;) are still present. I really don’t remember what the chart originally looked like, but I was pleasantly surprised to see it’s still displayed at all!&lt;/p&gt;

&lt;h3 id=&quot;external-links&quot;&gt;External links&lt;/h3&gt;

&lt;p&gt;Things got really interesting when I tried opening &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq2/KS4000.WQ2&quot;&gt;this WQ2 spreadsheet&lt;/a&gt;. Upon opening, Quattro Pro comes up with a preview of the file, and a &lt;em&gt;Hotlinks&lt;/em&gt; dialogue box:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/ks4000_wq2_hotlinks.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I highlighted some areas in red; we’ll get back to that in a second. I first selected &lt;em&gt;Open Supporting&lt;/em&gt; in the dialogue box, and pressed &lt;em&gt;OK&lt;/em&gt;. The result was that both this spreadsheet and &lt;a href=&quot;https://github.com/openpreserve/format-corpus/blob/master/office/spreadsheet/wq2/KSBASE.WQ2&quot;&gt;another one (our earlier &lt;em&gt;KSBASE.WQ2&lt;/em&gt;)&lt;/a&gt; were loaded, so apparently it contains a link to that file. After loading, the spreadsheet displays as follows:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/ks4000_wq2_opensupporting.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now, pay special attention to the highlighted cells and compare them against the initial preview. This reveals some pretty dramatic changes:
some of the preview values in rows 4 and 5 are replaced by &lt;em&gt;Evaluator Stack Error&lt;/em&gt; after the file is fully loaded. Clicking on those cells also results in odd sequences of Unicode characters in the formula bar:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/stackError.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;My best guess is that these cells contain external references that -for whatever reason- aren’t resolved correctly. This in turn also influences the calculation results in cells &lt;em&gt;H17&lt;/em&gt; and &lt;em&gt;H18&lt;/em&gt;. I also tried opening the file using the &lt;em&gt;Update References&lt;/em&gt; and &lt;em&gt;None&lt;/em&gt; options; in both cases the results are the same as described above.&lt;/p&gt;

&lt;p&gt;With no access to the software that originally created the files, it is impossible to tell why the external references aren’t working. It could be a bug of Quattro Pro, but I’m not ruling out that the spreadsheet may simply be faulty (e.g. perhaps the original referenced spreadsheet got replaced by an identically-named file at some point). Nevertheless, the fact that the correct cell values are displayed in the preview, shows that the original data &lt;em&gt;are&lt;/em&gt; present, and it’s rather worrying that Quattro Pro doesn’t offer an option to fully load the files without updating/overruling them.&lt;/p&gt;

&lt;h2 id=&quot;implications-for-long-term-access&quot;&gt;Implications for long-term access&lt;/h2&gt;

&lt;p&gt;Apart from Quattro Pro, modern spreadsheet programs offer no support for Quattro Pro for DOS spreadsheets. The most recent version of Quattro Pro still reads both DOS era formats, although there are some problems. Some of these are formatting-related (e.g. cells that contain data showing up as blank), and can be easily remedied. The behaviour of one spreadsheet with an external dependency is much more problematic, especially because Quattro Pro updates the original values (which are stored in the file) after fully loading the spreadsheet. Migrating this spreadsheet to another format would result in the loss of some of the original data. So, based on this (admittedly cursory) analysis it looks like no modern-day software is able to correctly handle the Quattro Pro for DOS formats. Add to this that the Quattro Pro for DOS formats are proprietary with (as far as I’m aware) no publicly available specifications, and I think we have a pretty strong candidate for a format that may be (nearly) obsolete.&lt;/p&gt;

&lt;h2 id=&quot;solutions&quot;&gt;Solutions&lt;/h2&gt;

&lt;p&gt;Although I haven’t explored any concrete solutions for accessing Quattro Pro for DOS spreadsheets, some obvious routes would be:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Run an old copy of Quattro Pro for DOS (e.g. in a virtual machine) and export the spreadsheet to e.g. the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Lotus_1-2-3&quot;&gt;Lotus 1-2-3&lt;/a&gt; format (which is still reasonably well supported today).&lt;/li&gt;
  &lt;li&gt;Run an old version of MS Excel (2003 or earlier) and export the spreadsheet to the &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/XLS&quot;&gt;XLS&lt;/a&gt; format.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If anyone decides to have a go at this, I’d be very interested to see the results!&lt;/p&gt;

&lt;h2 id=&quot;update-analysis-by-euan-cochrane-lotus-1-2-3-problematic-as-well&quot;&gt;Update: analysis by Euan Cochrane; Lotus 1-2-3 problematic as well?&lt;/h2&gt;

&lt;p&gt;In response to this blog post, Euan Cochrane has done &lt;a href=&quot;https://www.webarchive.org.uk/wayback/en/archive/20160105203022/http://openpreservation.org/blog/2014/10/29/opening-johans-quattro-pro-files-quattro-pro-6-win-311/&quot;&gt;some  additional tests with my Quattro Pro files&lt;/a&gt; using Quattro Pro 6 running in an emulated environment. Euan’s analysis is highly recommended for any readers of this post. Moreover, trying to open a Lotus 1-2-3 file that Euan created as part of his analysis made me realise that Lotus 1-2-3 spreadsheets may also be more problematic than I initially thought. See my &lt;a href=&quot;https://www.webarchive.org.uk/wayback/en/archive/20160105203022mp_/http://openpreservation.org/blog/2014/10/29/opening-johans-quattro-pro-files-quattro-pro-6-win-311/#comment-354&quot;&gt;comment&lt;/a&gt; under Euan’s blog post.&lt;/p&gt;

&lt;h2 id=&quot;post-script-february-2019&quot;&gt;Post script February 2019&lt;/h2&gt;

&lt;p&gt;In the folder from which I recovered this blog post, I also found a note with what looks like a comment to either this post, or Euan’s follow-up post (as these comments have disappeared from the OPF site there’s no way to tell). As it contains some relevant additional information on the Lotus 1-2-3, I’ve included it below:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;After I posted the above comment I also found out that IBM officially ceased its support for Lotus 123 on 30 September 2014. See the announcement here:&lt;/p&gt;

  &lt;p&gt;&lt;a href=&quot;http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=ca&amp;amp;infotype=an&amp;amp;appname=iSource&amp;amp;supplier=897&amp;amp;letternum=ENUS913-091&quot;&gt;http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=ca&amp;amp;infotype=an&amp;amp;appname=iSource&amp;amp;supplier=897&amp;amp;letternum=ENUS913-091&lt;/a&gt;&lt;/p&gt;

  &lt;p&gt;Here’s a recent feature on this from the Register:&lt;/p&gt;

  &lt;p&gt;&lt;a href=&quot;https://www.theregister.co.uk/2014/10/02/so_long_lotus_123_ibm_ceases_support_after_over_30_years_of_code/&quot;&gt;https://www.theregister.co.uk/2014/10/02/so_long_lotus_123_ibm_ceases_support_after_over_30_years_of_code/&lt;/a&gt;&lt;/p&gt;

  &lt;p&gt;This also applies to the Lotus SmartSuite and Organizer products (I’m not familiar with those products, so I have no idea if this has any additional format-related implications).&lt;/p&gt;

  &lt;h3 id=&quot;what-will-happen-to-the-lotus-1-2-3-codebase&quot;&gt;What will happen to the Lotus 1-2-3 codebase?&lt;/h3&gt;

  &lt;p&gt;This really makes me wonder what will happen to the old Lotus 1-2-3 codebase. Interestingly, in 2012 IBM discontinued its &lt;a href=&quot;http://en.wikipedia.org/wiki/IBM_Lotus_Symphony&quot;&gt;IBM Lotus Symphony&lt;/a&gt; suite, after which they donated the codebase of that product to the Apache Software Foundation, who then merged it into Apache OpenOffice.&lt;/p&gt;

  &lt;p&gt;I’m not aware of any efforts to save the Lotus 1-2-3 codebase, but I think this would be immensely helpful to keep those old spreadsheet formats accessible. I don’t know if this is something IBM would be willing to do (e.g. by releasing it as open source). This could also be interesting to of initiatives like the &lt;a href=&quot;http://www.documentliberation.org/&quot;&gt;Document Liberation Project&lt;/a&gt;. If anyone has any info/additional thoughts on this please leave a comment!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;post-script-may-2025&quot;&gt;Post script May 2025&lt;/h2&gt;

&lt;p&gt;Over ten years after I wrote this post, I just &lt;a href=&quot;/2025/05/28/quattro-pro-for-dos-revisited-an-obsolete-format-no-more&quot;&gt;posted this follow-up&lt;/a&gt;. This shows that LibreOffice Calc is now able to read my old Quattro Pro for DOS spreadsheets, although there are some issues.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2014/10/29/quattro-pro-dos-obsolete-format-last/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;I wasn’t able to locate these converters anywhere on Microsoft’s website. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Just in case anyone’s wondering: these spreadsheets calculate a soil’s saturated &lt;a href=&quot;http://en.wikipedia.org/wiki/Hydraulic_conductivity&quot;&gt;hydraulic conductivity&lt;/a&gt; from field measurement data using the &lt;a href=&quot;http://www.samsamwater.com/library/DETERMINING_HYDRAULIC_CONDUCTIVITY_WITH_THE_INVERSED_AUGER_HOLE_AND_INFILTROMETER_METHODS.pdf&quot;&gt;inverse auger hole method&lt;/a&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2014/10/29/quattro-pro-dos-obsolete-format-last</link>
                <guid>https://bitsgalore.org/2014/10/29/quattro-pro-dos-obsolete-format-last</guid>
                <pubDate>2014-10-29T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Running archived Android apps on a PC&#58; first impressions</title>
                <description>&lt;p&gt;Earlier this week I had a discussion with some colleagues about the archiving of mobile phone and tablet apps (iPhone/Android), and, equally important, ways to provide long-term access. The immediate incentive for this was an announcement by a Dutch publisher, who recently published a children’s book that is accompanied by its own app. Also, there are already several examples of Ebooks that are published exclusively as mobile apps. So, even though we’re not receiving any apps in our collections yet, we’ll have to address this at some point, and it’s useful to have an initial idea of the challenges that may lie ahead.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;The scope of this blog is &lt;em&gt;not&lt;/em&gt; to provide any in-depth coverage of the long-term preservation of mobile apps. Instead, I was just curious about two specific aspects:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Is it possible to run a phone app on a regular PC, and, if yes, how?&lt;/li&gt;
  &lt;li&gt;How can you use this to run an archived copy of an app?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I spent a few afternoons running some preliminary tests. Since I’m pretty sure other institutions must be looking into this as well, I thought I might as well share the results, as well as some useful resources I came across along the way. For now I limited myself to the &lt;a href=&quot;http://www.android.com/&quot;&gt;Android&lt;/a&gt; platform (&lt;a href=&quot;http://en.wikipedia.org/wiki/IOS&quot;&gt;iOS&lt;/a&gt; presents additional challenges because of its restrictive license).&lt;/p&gt;

&lt;h2 id=&quot;the-android-app-format&quot;&gt;The Android app format&lt;/h2&gt;

&lt;p&gt;First of all it is helpful to know a bit more about the Android app format. Basically, an app (&lt;em&gt;.apk&lt;/em&gt;) file is just a ZIP archive with a specific file and directory structure. It is based on Java’s &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/Jar&quot;&gt;Jar&lt;/a&gt; format. For more information see this entry on Archive Team’s file format wiki:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://fileformats.archiveteam.org/wiki/APK&quot;&gt;http://fileformats.archiveteam.org/wiki/APK&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here you can also find how to download local copies of Android app files (e.g. to a PC), which is something that is not possible directly from the  &lt;a href=&quot;https://play.google.com/store/apps?hl=en&quot;&gt;Google Play store&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;running-android-on-a-pc&quot;&gt;Running Android on a PC&lt;/h2&gt;

&lt;p&gt;If you want to run Android on a regular (Linux or Windows) PC, several options exist. &lt;a href=&quot;http://www.extremetech.com/computing/83812-run-android-apps-on-your-windows-pc-2&quot;&gt;This article&lt;/a&gt; gives a good general overview. The “best” solution according to its author is a &lt;a href=&quot;http://www.bluestacks.com/app-player.html&quot;&gt;third-party developed app player&lt;/a&gt;. However, that player only works under Windows, it is proprietary, and non-free: to use it, you either pay a monthly fee, or put up with “sponsored apps”. Google’s Android SDK also includes an &lt;a href=&quot;http://developer.android.com/tools/help/emulator.html&quot;&gt;emulator&lt;/a&gt;, which is mainly targeted at app developers. I didn’t look into that now, mainly because of its alleged poor performance. Instead, I went for a third option.&lt;/p&gt;

&lt;h2 id=&quot;android-on-virtualbox&quot;&gt;Android on VirtualBox&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.android-x86.org/&quot;&gt;Android-x86 Project&lt;/a&gt; has created a port of the &lt;a href=&quot;http://source.android.com/&quot;&gt;Android Open Source Project&lt;/a&gt; that runs on &lt;a href=&quot;http://en.wikipedia.org/wiki/X86&quot;&gt;X86-based&lt;/a&gt; architectures. This opens op the possibility to run Android on an ordinary PC, either as the main operating system, or in a virtual machine. So, I decided to take the latter route and installed Android on a virtual machine using &lt;a href=&quot;https://www.virtualbox.org/&quot;&gt;VirtualBox&lt;/a&gt;. This is relatively straightforward, and several excellent step-by-step descriptions on how to do this exist, for instance:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.howtogeek.com/164570/how-to-install-android-in-virtualbox/&quot;&gt;How to Install Android in VirtualBox&lt;/a&gt; (&lt;a href=&quot;http://web.archive.org/web/20141023111241/http://www.howtogeek.com/164570/how-to-install-android-in-virtualbox/&quot;&gt;archived link&lt;/a&gt;); and&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.fixedbyvonnie.com/2014/02/install-android-4-4-kitkat-windows-using-virtualbox/&quot;&gt;How to install Android 4.4 KitKat in Windows using VirtualBox&lt;/a&gt; (&lt;a href=&quot;http://web.archive.org/web/20141023111321/http://www.fixedbyvonnie.com/2014/02/install-android-4-4-kitkat-windows-using-virtualbox/&quot;&gt;archived link&lt;/a&gt;).&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The latest ISO images from Android-x86 can be found here (for this test I used version 4.4):&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://sourceforge.net/projects/android-x86/files/&quot;&gt;http://sourceforge.net/projects/android-x86/files/&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;running-and-waking-up-android&quot;&gt;Running and waking up Android&lt;/h2&gt;

&lt;p&gt;Setting up the virtual machine was easy enough, and Android appeared to work well straight away. One thing that can be confusing for first-time users is the way VirtualBox deals with mouse input: once you click your mouse in the screen area that is occupied by the virtual machine, VirtualBox shows a dialog asking whether it should “capture” the mouse. Once you click &lt;em&gt;Capture&lt;/em&gt;, the mouse can only be used inside the virtual machine (and not for any other applications that are running on the host machine). You can  &lt;em&gt;uncapture&lt;/em&gt; the mouse at any time by pressing the right-hand &lt;em&gt;Ctrl&lt;/em&gt; key. Another thing that initially puzzled me, is that Android enters sleep mode after several minutes of inactivity, resulting in a black screen. Once in sleep mode, it is not very obvious how to wake it up again (even rebooting the VM didn’t do the trick). After some searching I found that &lt;a href=&quot;http://www.sysads.co.uk/2014/01/install-android-4-3-virtualbox-screenshots&quot;&gt;the solution&lt;/a&gt; here is to press the &lt;a href=&quot;http://en.wikipedia.org/wiki/Menu_key&quot;&gt;Menu key&lt;/a&gt; on the keyboard (located next to the right-hand &lt;em&gt;Ctrl&lt;/em&gt; key on most keyboards), which instantly brings the machine back to life.&lt;/p&gt;

&lt;h2 id=&quot;moving-an-app-to-the-virtual-machine&quot;&gt;Moving an app to the virtual machine&lt;/h2&gt;

&lt;p&gt;To install an app in Android, you would normally go to the &lt;a href=&quot;https://play.google.com/store/apps?hl=en&quot;&gt;Google Play store&lt;/a&gt;. In an archival setting it is more likely that you already have an archived copy stored somewhere, so what we need here is the ability to install from a local &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/APK&quot;&gt;APK file&lt;/a&gt;. This is also known as “side loading”, and &lt;a href=&quot;http://www.cnet.com/how-to/how-to-install-apps-outside-of-google-play/&quot;&gt;this article&lt;/a&gt; gives general instructions on how to do this with a physical device. Since we’re running Android on a virtual machine here, things are a bit different, and ideally we should  be able to share a folder between the host machine and the (virtual) guest device. In theory this is all possible in VirtualBox, but as it turns out &lt;a href=&quot;http://superuser.com/questions/665696/shared-folder-in-virtualbox-with-android-not-working&quot;&gt;it doesn’t work&lt;/a&gt; because &lt;a href=&quot;http://stackoverflow.com/questions/8235165/getting-vbox-guest-addtions-for-android-x86&quot;&gt;Android-86 doesn’t support VirtualBox Guest Additions&lt;/a&gt;. As a workaround, I ended up uploading my the APK to DropBox, and then opened DropBox in Android’s web browser to download the file.&lt;/p&gt;

&lt;h2 id=&quot;installing-the-app&quot;&gt;Installing the app&lt;/h2&gt;

&lt;p&gt;The downloaded APK is now located in the &lt;em&gt;Download&lt;/em&gt; folder, which is accessible using Android’s file browser. After clicking on it, the following security warning popped up:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/installBlocked.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is because, by default, apps from unknown sources (i.e. other than the &lt;a href=&quot;https://play.google.com/store/apps?hl=en&quot;&gt;Google Play store&lt;/a&gt;) are blocked by Android. The solution here is to click on &lt;em&gt;Settings&lt;/em&gt;, which opens up the security settings dialog:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/10/tickUnknownSources.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here, check the &lt;em&gt;Unknown sources&lt;/em&gt; option &lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. Then go back to the &lt;em&gt;Downloads&lt;/em&gt; folder and click on the APK file again. It will now install the app.&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;In this blog I provided some basic information about Android’s APK format, how to run Android in VirtualBox, and how to install an archived app. I tested this myself with a handful of apps. One thing I noticed was that some apps didn’t quite work as expected on my virtual Android machine, but as I didn’t have access to a ‘real’ (physical) device it’s impossible to tell whether this  was due to the virtualisation or just a shortcoming of those apps. This would obviously need more work. Nevertheless, considering that I only spent a few  odd afternoons on this, this approach looks quite promising.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2014/10/23/running-archived-android-apps-pc-first-impressions/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Note that this will also enable you to install apps that are possibly harmful; use with care! &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2014/10/23/running-archived-android-apps-pc-first-impressions</link>
                <guid>https://bitsgalore.org/2014/10/23/running-archived-android-apps-pc-first-impressions</guid>
                <pubDate>2014-10-23T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Six ways to decode a lossy JP2</title>
                <description>&lt;p&gt;Some time ago Will Palmer, Peter May and Peter Cliff of the British Library published a really interesting &lt;a href=&quot;http://www.scape-project.eu/publication/palmer-ipres2013&quot;&gt;paper that investigated three different JPEG 2000 codecs&lt;/a&gt;, and their effects on image quality in response to lossy compression. Most remarkably, their analysis revealed differences not only in the way these codecs &lt;em&gt;encode&lt;/em&gt; (compress) an image, but also in the &lt;em&gt;decoding&lt;/em&gt; phase. In other words: reading the same lossy JP2 produced different results depending on which implementation was used to decode it.&lt;/p&gt;

&lt;p&gt;A limitation of the paper’s methodology is that it obscures the individual effects of the encoding and decoding components, since both are essentially lumped in the analysis. Thus, it’s not clear how much of the observed degradation in image quality is caused by the compression, and how much by the decoding. This made me wonder how similar the &lt;em&gt;decode&lt;/em&gt; results of different codecs really are.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;an-experiment&quot;&gt;An experiment&lt;/h2&gt;

&lt;p&gt;To find out, I ran a simple experiment:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Encode a TIFF image to JP2.&lt;/li&gt;
  &lt;li&gt;Decode the JP2 back to TIFF using different decoders.&lt;/li&gt;
  &lt;li&gt;Compare the decode results using some similarity measure.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;codecs-used&quot;&gt;Codecs used&lt;/h2&gt;

&lt;p&gt;I used the following codecs:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.kakadusoftware.com/&quot;&gt;Kakadu&lt;/a&gt; v7.2.2 (kakadu)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.openjpeg.org/&quot;&gt;OpenJPEG&lt;/a&gt; 2.0 (opj20)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.imagemagick.org/&quot;&gt;ImageMagick&lt;/a&gt; 6.8.9-8 (im)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.graphicsmagick.org/&quot;&gt;GraphicsMagick&lt;/a&gt; 1.3.18 (gm)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.irfanview.com/&quot;&gt;IrfanView&lt;/a&gt; 4.35 with JPEG2000 plugin 4.33 (irfan)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that GraphicsMagick still uses the &lt;a href=&quot;http://www.ece.uvic.ca/~frodo/jasper/&quot;&gt;JasPer&lt;/a&gt; library for JPEG 2000. ImageMagick now uses OpenJPEG (older versions used JasPer). IrfanViews’s JPEG 2000 plugin is made by &lt;a href=&quot;http://www.luratech.com/en/products/luratech-jp2-irfanview-plug-in/&quot;&gt;LuraTech&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;creating-the-jp2&quot;&gt;Creating the JP2&lt;/h2&gt;

&lt;p&gt;First I compressed my source TIFF (a grayscale newspaper page) to a lossy JP2 with a compression ratio about about 4:1. For this example I used OpenJPEG, with the following command line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;opj_compress &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; krant.tif &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; krant_oj_4.jp2 &lt;span class=&quot;nt&quot;&gt;-r&lt;/span&gt; 4 &lt;span class=&quot;nt&quot;&gt;-I&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; RPCL &lt;span class=&quot;nt&quot;&gt;-n&lt;/span&gt; 7 &lt;span class=&quot;nt&quot;&gt;-c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;256,256],[256,256],[256,256],[256,256],[256,256],[256,256],[256,256] &lt;span class=&quot;nt&quot;&gt;-b&lt;/span&gt; 64,64
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;decoding-the-jp2&quot;&gt;Decoding the JP2&lt;/h2&gt;

&lt;p&gt;Next I decoded this image back to TIFF using the aforementioned codecs. I used the following command lines:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Codec&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Command line&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;opj20&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;opj_decompress -i krant_oj_4.jp2 -o krant_oj_4_oj.tif&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;kakadu&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu.tif&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;kakadu-precise&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kdu_expand -i krant_oj_4.jp2 -o krant_oj_4_kdu_precise.tif -precise&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;irfan&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Used GUI&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;im&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;convert krant_oj_4.jp2 krant_oj_4_im.tif&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;gm&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gm convert krant_oj_4.jp2 krant_oj_4_gm.tif&lt;/code&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This resulted in 6 images. Note that I ran Kakadu twice: once using the default settings, and also with the &lt;em&gt;-precise&lt;/em&gt; switch, which “forces the use of 32-bit representations”.&lt;/p&gt;

&lt;h2 id=&quot;overall-image-quality&quot;&gt;Overall image quality&lt;/h2&gt;

&lt;p&gt;As a first analysis step I computed the overall peak signal to noise ratio (PSNR) for each decoded image, relative to the source TIFF:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Decoder&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;PSNR&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;opj20&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;48.08&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;kakadu&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;48.01&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;kakadu-precise&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;48.08&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;irfan&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;48.08&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;im&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;48.08&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;gm&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;48.07&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So &lt;em&gt;relative to the source image&lt;/em&gt; these results are only marginally different.&lt;/p&gt;

&lt;h2 id=&quot;similarity-of-decoded-images&quot;&gt;Similarity of decoded images&lt;/h2&gt;

&lt;p&gt;But let’s have a closer look at how similar the different decoded images are. I did this by computing PSNR values of all possible decoder pairs. This produced the following matrix:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Decoder&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;opj20&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;kakadu&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;kakadu-precise&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;irfan&lt;/th&gt;
      &lt;th&gt;im&lt;/th&gt;
      &lt;th&gt;gm&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;opj20&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.52&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;78.53&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;79.17&lt;/td&gt;
      &lt;td&gt;96.35&lt;/td&gt;
      &lt;td&gt;64.43&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;kakadu&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.52&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.51&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.52&lt;/td&gt;
      &lt;td&gt;57.52&lt;/td&gt;
      &lt;td&gt;57.23&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;kakadu-precise&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;78.53&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.51&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;79.00&lt;/td&gt;
      &lt;td&gt;78.53&lt;/td&gt;
      &lt;td&gt;64.52&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;irfan&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;79.17&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.52&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;79.00&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td&gt;79.18&lt;/td&gt;
      &lt;td&gt;64.44&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;im&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;96.35&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.52&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;78.53&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;79.18&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
      &lt;td&gt;64.43&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;strong&gt;gm&lt;/strong&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64.43&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;57.23&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64.52&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64.44&lt;/td&gt;
      &lt;td&gt;64.43&lt;/td&gt;
      &lt;td&gt;-&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Note that, unlike the table in the previous section, these PSNR values are only a measure of the &lt;em&gt;similarity&lt;/em&gt; between the different decoder results. They don’t directly say anything about &lt;em&gt;quality&lt;/em&gt; (since we’re not comparing against the source image). Interestingly, the PSNR values in the matrix show two clear groups:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Group A&lt;/strong&gt;: all combinations of OpenJPEG, Irfanview, ImageMagick and Kakadu in &lt;em&gt;precise&lt;/em&gt; mode, all with a PSNR of &amp;gt; 78 dB.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Group B&lt;/strong&gt;: all remaining decoder combinations, with a PSNR of &amp;lt; 64 dB.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What this means is that OpenJPEG, Irfanview, ImageMagick and Kakadu in &lt;em&gt;precise&lt;/em&gt; mode all decode the image in a similar way, whereas Kakadu (default mode) and GraphicsMagick behave differently. Another way of looking at this is to count the pixels that have different values for each combination. This yields up to 2 % different pixels for all combinations in group &lt;em&gt;A&lt;/em&gt;, and about 12 % in group &lt;em&gt;B&lt;/em&gt;. Finally, we can look at the peak absolute error value (PAE) of each combination, which is the maximum value difference for any pixel in the image. This figure was 1 pixel level (0.4 % of the full range) in both groups.&lt;/p&gt;

&lt;p&gt;I also repeated the above procedure for a small RGB image. In this case I used Kakadu as the encoder. The decoding results of that experiment showed the same overall pattern, although the differences between groups &lt;em&gt;A&lt;/em&gt; and &lt;em&gt;B&lt;/em&gt; were even more pronounced, with PAE values in group &lt;em&gt;B&lt;/em&gt; reaching up to 3 pixel values (1.2 % of full range) for some decoder combinations.&lt;/p&gt;

&lt;h2 id=&quot;what-does-this-say-about-decoding-quality&quot;&gt;What does this say about decoding quality?&lt;/h2&gt;

&lt;p&gt;It would be tempting to conclude from this that the codecs that make up group &lt;em&gt;A&lt;/em&gt; provide better quality decoding than the others (GraphicsMagick, Kakadu in default mode). If this were true, one would expect that the overall PSNR values &lt;em&gt;relative to the source TIFF&lt;/em&gt; (see previous table) would be higher for those codecs. But the values in the table are only marginally different. Also, in the test on the small RGB image, running Kakadu in &lt;em&gt;precise&lt;/em&gt; mode &lt;em&gt;lowered&lt;/em&gt; the overall PSNR value (although by a tiny amount). Such small effects could be due to chance, and for a conclusive answer one would need to repeat the experiment for a large number of images, and test the PSNR differences for statistical significance (as was done in the BL analysis).&lt;/p&gt;

&lt;p&gt;I’m still somewhat surprised that even in group &lt;em&gt;A&lt;/em&gt; the decoding results aren’t &lt;em&gt;identical&lt;/em&gt;, but I suspect this has something to do with small rounding errors that arise during the decode process (maybe someone with a better understanding of the mathematical intricacies of JPEG 2000 decoding can comment on this). Overall, these results suggest that the errors that are introduced by the decode step are very small when compared against the encode errors.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;OpenJPEG, (recent versions of) ImageMagick, IrfanView and Kakadu in &lt;em&gt;precise&lt;/em&gt; mode all produce similar results when decoding lossily compressed JP2s, whereas Kakadu in default mode and GraphicsMagick (which uses the JasPer library) behave differently. These differences are very small when compared to the errors that are introduced by the encoding step, but for critical decode applications (migrate lossy JP2 to something else) they may still be significant. As both ImageMagick and GraphicsMagick are often used for calculating image (quality) statistics, the observed differences also affect the outcome of such analyses: calculating PSNR for a JP2 with ImageMagick and GraphicsMagick results in two different outcomes!&lt;/p&gt;

&lt;p&gt;For &lt;em&gt;losslessy&lt;/em&gt; compressed JP2s, the decode results for all tested codecs are 100% identical&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;This tentative analysis does not support any conclusions on which decoders are ‘better’. That would need additional tests with more images. I don’t have time for that myself, but I’d be happy to see others have a go at this!&lt;/p&gt;

&lt;h2 id=&quot;link&quot;&gt;Link&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://www.scape-project.eu/publication/palmer-ipres2013&quot;&gt;William Palmer, Peter May and Peter Cliff: An Analysis of Contemporary JPEG2000 Codecs for Image Format Migration (Proceedings, iPres 2013)&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2014/09/26/six-ways-decode-lossy-jp2/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;Identical in terms of pixel values; for this analysis I didn’t look at things such as embedded ICC profiles, &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Handling+of+ICC+profiles&quot;&gt;which not all encoders/decoders handle well&lt;/a&gt;. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2014/09/26/six-ways-decode-lossy-jp2</link>
                <guid>https://bitsgalore.org/2014/09/26/six-ways-decode-lossy-jp2</guid>
                <pubDate>2014-09-26T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Jpylyzer software finalist voor digitale duurzaamheidsprijs</title>
                <description>&lt;p&gt;Vandaag maakte de Britse &lt;a href=&quot;http://www.dpconline.org/about&quot;&gt;&lt;em&gt;Digital Preservation
Coalition&lt;/em&gt;&lt;/a&gt; de finalisten bekend die in
de race zijn voor de &lt;a href=&quot;http://www.dpconline.org/advocacy/awards/digital-preservation-awards-2014&quot;&gt;&lt;em&gt;Digital Preservation Awards
2014&lt;/em&gt;&lt;/a&gt;.
Deze prijs is in 2004 in het leven geroepen om aandacht te vestigen op
initiatieven die een belangrijke bijdrage leveren aan het toegankelijk
houden van digitaal erfgoed.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;In de categorie &lt;a href=&quot;http://www.dpconline.org/newsroom/latest-news/1272-dpa-2014-award-researchandinnovation-finalists&quot;&gt;&lt;em&gt;Research and
Innovation&lt;/em&gt;&lt;/a&gt;
is een op de KB door de afdeling Onderzoek ontwikkelde softwaretool
genomineerd: &lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;&lt;em&gt;jpylyzer&lt;/em&gt;&lt;/a&gt;. Met
&lt;em&gt;jpylyzer&lt;/em&gt; kun je op een eenvoudige manier controleren of &lt;em&gt;JP2&lt;/em&gt; (JPEG
2000) beeldbestanden technisch in orde zijn. Binnen de KB wordt de tool
onder meer ingezet bij de kwaliteitscontrole van gedigitaliseerde
boeken, kranten en tijdschriften. &lt;em&gt;Jpylyzer&lt;/em&gt; wordt ook gebruikt door
diverse internationale collega-instellingen.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Jpylyzer&lt;/em&gt; is deels ontwikkeld binnen het Europese project
&lt;a href=&quot;http://www.scape-project.eu/&quot;&gt;&lt;em&gt;SCAPE&lt;/em&gt;&lt;/a&gt;, waarin de KB projectpartner is.
De uiteindelijke winnaars worden op 17 november bekendgemaakt.&lt;/p&gt;

&lt;p&gt;Meer informatie over de nominatie van &lt;em&gt;jpylyzer&lt;/em&gt; is te vinden op de
website van de &lt;em&gt;Digital Preservation Coalition&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.dpconline.org/newsroom/latest-news/1271-dpa-2014finalists&quot;&gt;http://www.dpconline.org/newsroom/latest-news/1271-dpa-2014finalists&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Het volgende artikel is interessant voor wie meer wil weten over
&lt;em&gt;jpylyzer&lt;/em&gt;, en waarom we zo’n tool eigenlijk nodig hebben:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://www.kb.nl/organisatie/onderzoek-expertise/onderzoek-digitalisering-en-digitale-duurzaamheid/afgesloten-projecten/jpylyzer-jp2-validator-and-extractor&quot;&gt;Miljoenen digitale bestanden controleer je niet even met de hand&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ten slotte is hier de &lt;em&gt;jpylyzer&lt;/em&gt; homepage:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;http://jpylyzer.openpreservation.org/&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.kbresearch.nl/2014/09/11/jpylyzer-software-finalist-voor-digitale-duurzaamheidsprijs/&quot;&gt;KB Research blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2014/09/11/jpylyzer-software-finalist-voor-digitale-duurzaamheidsprijs</link>
                <guid>https://bitsgalore.org/2014/09/11/jpylyzer-software-finalist-voor-digitale-duurzaamheidsprijs</guid>
                <pubDate>2014-09-11T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>When (not) to migrate a PDF to PDF/A</title>
                <description>&lt;p&gt;It is well-known that PDF documents can contain features that are preservation risks (e.g. see &lt;a href=&quot;https://web.archive.org/web/20130515073645/http://libraries.stackexchange.com/questions/964/what-preservation-risks-are-associated-with-the-pdf-file-format&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Portable+Document+Format&quot;&gt;here&lt;/a&gt;). Migration of existing &lt;em&gt;PDF&lt;/em&gt;s to &lt;em&gt;PDF/A&lt;/em&gt; is sometimes advocated as a strategy for mitigating these risks. However, the benefits of this approach are often questionable, and the migration process can also be quite risky in itself. As I often get questions on this subject, I thought it might be worthwhile to do a short write-up on this.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;pdfa-is-a-profile&quot;&gt;&lt;em&gt;PDF/A&lt;/em&gt; is a profile&lt;/h2&gt;

&lt;p&gt;First, it’s important to stress that each of the &lt;em&gt;PDF/A&lt;/em&gt; standards (&lt;em&gt;A-1&lt;/em&gt;, &lt;em&gt;A-2&lt;/em&gt; and &lt;em&gt;A-3&lt;/em&gt;) are really just &lt;em&gt;profiles&lt;/em&gt; within the &lt;em&gt;PDF&lt;/em&gt; format. More specifically, &lt;em&gt;PDF/A-1&lt;/em&gt; offers a subset of &lt;a href=&quot;http://acroeng.adobe.com/PDFReference/PDF_1.4/PDF%20Reference%201.4.pdf&quot;&gt;&lt;em&gt;PDF 1.4&lt;/em&gt;&lt;/a&gt;, whereas &lt;em&gt;PDF/A-2&lt;/em&gt; and &lt;em&gt;PDF/A-3&lt;/em&gt; are based on &lt;a href=&quot;http://acroeng.adobe.com/PDFReference/ISO32000/PDF32000-Adobe.pdf&quot;&gt;the ISO 32000 version of &lt;em&gt;PDF 1.7&lt;/em&gt;&lt;/a&gt;. What  these profiles have in common, is that they prohibit some features (e.g. multimedia, encryption, interactive content) that are allowed in ‘regular’ &lt;em&gt;PDF&lt;/em&gt;. Also, they narrow down the way other features are implemented, for example by requiring that all fonts are embedded in the document. This can be illustrated with the following simple Venn diagram below, which shows the feature sets of the aforementioned &lt;em&gt;PDF&lt;/em&gt; flavours:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/08/pdfVenn.png&quot; alt=&quot;PDF Venn diagram&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here we see how &lt;em&gt;PDF/A-1&lt;/em&gt; is a subset of &lt;em&gt;PDF 1.4&lt;/em&gt;, which in turn is a subset of &lt;em&gt;PDF 1.7&lt;/em&gt;. &lt;em&gt;PDF A/2&lt;/em&gt; and &lt;em&gt;PDF A/3&lt;/em&gt; (aggregated here as one entity for the sake of readability) are subsets of &lt;em&gt;PDF 1.7&lt;/em&gt;, and include all the features of &lt;em&gt;PDF A/1&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Keeping this in mind, it’s easy to see that migrating an arbitrary &lt;em&gt;PDF&lt;/em&gt; to &lt;em&gt;PDF/A&lt;/em&gt; can result in problems.&lt;/p&gt;

&lt;h2 id=&quot;loss-alteration-during-migration&quot;&gt;Loss, alteration during migration&lt;/h2&gt;

&lt;p&gt;Suppose, as an example, that we have a &lt;em&gt;PDF&lt;/em&gt; that contains a movie. This is prohibited in &lt;em&gt;PDF/A&lt;/em&gt;, so migrating to &lt;em&gt;PDF/A&lt;/em&gt; will simply result in the loss of the multimedia content. Another example are fonts: all fonts in a &lt;em&gt;PDF/A&lt;/em&gt; document must be embedded. But what happens if the source &lt;em&gt;PDF&lt;/em&gt; uses non-embedded fonts that are not available on the machine on which the migration is run? Will the  migration tool exit with a warning, or will it silently use some alternative, perhaps similar font? And how do you check for this?&lt;/p&gt;

&lt;h2 id=&quot;complexity-and-effect-of-errors&quot;&gt;Complexity and effect of errors&lt;/h2&gt;

&lt;p&gt;Also, migrations like these typically involve a complete re-processing of the &lt;em&gt;PDF&lt;/em&gt;’s internal structure. The format’s complexity implies that there’s a lot of potential for things to go wrong in this process. This is particularly true if the source &lt;em&gt;PDF&lt;/em&gt; contains subtle errors, in which case the risk of losing information is very real (even though the original document may be perfectly readable in a viewer). Since we don’t really have any tools for detecting such errors (i.e. a &lt;a href=&quot;http://duff-johnson.com/wp-content/uploads/2014/01/PDFValidationDreamOrYawn.pdf&quot;&gt;sufficiently reliable &lt;em&gt;PDF&lt;/em&gt; validator&lt;/a&gt;), these cases can be difficult to deal with. Some further considerations can be found &lt;a href=&quot;http://web.archive.org/web/20130605142355/http://libraries.stackexchange.com/questions/1117/converting-invalid-pdfs-or-not-for-digital-preservation&quot;&gt;here&lt;/a&gt; (the context there is slightly different, but the risks are similar).&lt;/p&gt;

&lt;h2 id=&quot;digitised-vs-born-digital&quot;&gt;Digitised vs born-digital&lt;/h2&gt;

&lt;p&gt;The origin of the source &lt;em&gt;PDF&lt;/em&gt;s may be another thing to take into account. If &lt;em&gt;PDF&lt;/em&gt;s were originally created as part of a digitisation project (e.g. scanned books), the &lt;em&gt;PDF&lt;/em&gt; is usually little more than a wrapper around a bunch of images, perhaps augmented by an OCR layer. Migrating such &lt;em&gt;PDF&lt;/em&gt;s to &lt;em&gt;PDF/A&lt;/em&gt; is pretty straightforward, since the source files are unlikely to contain any features that are not allowed in &lt;em&gt;PDF/A&lt;/em&gt;. At the same time, this also means that the benefits of migrating such files to &lt;em&gt;PDF/A&lt;/em&gt; are pretty limited, since the source &lt;em&gt;PDF&lt;/em&gt;s weren’t problematic to begin with!&lt;/p&gt;

&lt;p&gt;The potential benefits &lt;em&gt;PDF/A&lt;/em&gt; may be more obvious for a lot of born-digital content; however, for the reasons listed in the previous section, the migration is more complex, and there’s just a lot more that can go wrong (see also &lt;a href=&quot;http://qanda.digipres.org/19/what-are-the-benefits-and-risks-of-using-the-pdf-a-file-format?show=21#a21&quot;&gt;here&lt;/a&gt; for some additional considerations).&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Although migrating &lt;em&gt;PDF&lt;/em&gt; documents to &lt;em&gt;PDF/A&lt;/em&gt; may look superficially attractive, it is actually quite risky in practice, and it may easily result in unintentional data loss. Moreover, the risks increase with the number of preservation-unfriendly features, meaning that the migration is most likely to be successful for source &lt;em&gt;PDF&lt;/em&gt;s that weren’t problematic to begin with, which belies the very purpose of migrating to &lt;em&gt;PDF/A&lt;/em&gt;. For specific cases, migration to &lt;em&gt;PDF/A&lt;/em&gt; may still be a sensible approach, but the expected benefits should be weighed carefully against the risks. In the absence of stable, generally accepted tools for assessing the quality of &lt;em&gt;PDF&lt;/em&gt;s (both source &lt;em&gt;and&lt;/em&gt; destination!), it would also seem prudent to always keep the originals.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2014/08/27/when-not-migrate-pdf-pdfa/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2014/08/27/when-not-migrate-pdf-pdfa</link>
                <guid>https://bitsgalore.org/2014/08/27/when-not-migrate-pdf-pdfa</guid>
                <pubDate>2014-08-27T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>How to save a web page to the Internet Archive</title>
                <description>&lt;p&gt;This short tutorial shows how to take a snapshot of a web page, and save it to the Internet Archive’s &lt;a href=&quot;http://en.wikipedia.org/wiki/Wayback_Machine&quot;&gt;Wayback Machine&lt;/a&gt;.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;method-1-web-interface&quot;&gt;Method 1: web interface&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;Go to the Wayback website: &lt;a href=&quot;https://archive.org/web/&quot;&gt;https://archive.org/web/&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;Paste the URL of the page you want to archive into the &lt;em&gt;Save Page Now&lt;/em&gt; box (at the bottom-right).&lt;/li&gt;
  &lt;li&gt;Click on the &lt;em&gt;Save Page&lt;/em&gt; button (or press &lt;em&gt;enter&lt;/em&gt;).&lt;/li&gt;
  &lt;li&gt;Wait while the page is being crawled. Once the archiving process is complete, the URL of the archived page appears.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;method-2-bookmarklet&quot;&gt;Method 2: bookmarklet&lt;/h2&gt;

&lt;p&gt;This method is faster than using the web interface, but you will first need to install a &lt;a href=&quot;http://en.wikipedia.org/wiki/Bookmarklet&quot;&gt;bookmarklet&lt;/a&gt; (which is just a browser bookmark that contains some JavaScript).&lt;/p&gt;

&lt;h3 id=&quot;installation&quot;&gt;Installation&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Go to the &lt;em&gt;Save Page to Wayback Machine Bookmarklet&lt;/em&gt; link here:
 &lt;a href=&quot;http://marklets.com/Save%20Page%20to%20Wayback%20Machine.aspx&quot;&gt;http://marklets.com/Save%20Page%20to%20Wayback%20Machine.aspx&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Click at the left-hand site of the URL bar, and drag it to the bookmarks toolbar of your browser. The figure below shows how this works in FireFox:&lt;/p&gt;

    &lt;p&gt;&lt;img src=&quot;/images/2014/08/wbmarkletInstall.png&quot; alt=&quot;Installation of bookmarklet&quot; /&gt;&lt;/p&gt;

    &lt;p&gt;Alternatively you can also use &lt;em&gt;Add Bookmark&lt;/em&gt; in the &lt;em&gt;Bookmarks&lt;/em&gt; menu.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;h3 id=&quot;using-the-bookmarklet&quot;&gt;Using the bookmarklet&lt;/h3&gt;

&lt;ol&gt;
  &lt;li&gt;Open the web page that you want to save in your browser.&lt;/li&gt;
  &lt;li&gt;Click on &lt;em&gt;Save Page to Wayback Machine&lt;/em&gt; in the bookmarks toolbar.&lt;/li&gt;
  &lt;li&gt;Wait while the page is being crawled. Once the archiving process is complete, the URL of the archived page appears.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;method-3-chrome-extension&quot;&gt;Method 3: Chrome extension&lt;/h2&gt;

&lt;p&gt;If you’re using the Google Chrome browser, you may want to check out Jimmy Lin’s “Save a Page” extension. Once installed, it allows you to save a page by simply right-clicking on it. The extension can be found here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/lintool/chrome-archive-this-page&quot;&gt;https://github.com/lintool/chrome-archive-this-page&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Just follow the installation instructions on that page.&lt;/p&gt;

&lt;h2 id=&quot;limitations&quot;&gt;Limitations&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;Webmasters can use &lt;a href=&quot;http://en.wikipedia.org/wiki/Robots_exclusion_standard&quot;&gt;&lt;em&gt;robots.txt&lt;/em&gt;&lt;/a&gt; to prevent web crawlers from crawling/saving anything on their website.&lt;/li&gt;
  &lt;li&gt;If a webmaster decides to change the &lt;em&gt;robots.txt&lt;/em&gt; permissions at some point in the future, a saved page may be removed from the Wayback Machine. For details see: &lt;a href=&quot;https://archive.org/about/exclude.php&quot;&gt;https://archive.org/about/exclude.php&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;

&lt;p&gt;This tutorial partially draws from a &lt;a href=&quot;http://searchengineland.com/save-urls-wayback-machine-demand-191150&quot;&gt;blog post&lt;/a&gt; by Gary Price on &lt;a href=&quot;http://searchengineland.com/&quot;&gt;&lt;em&gt;Search Engine Land&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2014/08/02/How-to-save-a-web-page-to-the-Internet-Archive</link>
                <guid>https://bitsgalore.org/2014/08/02/How-to-save-a-web-page-to-the-Internet-Archive</guid>
                <pubDate>2014-08-02T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Why can't we have digital preservation tools that just work?</title>
                <description>&lt;p&gt;One of my first blogs here covered an &lt;a href=&quot;/2011/09/21/evaluation-identification-tools-first-results-scape&quot;&gt;evaluation of a number of format identification tools&lt;/a&gt;. One of the more surprising results of that work was that out of the five tools that were tested, no less than four of them (&lt;em&gt;FITS&lt;/em&gt;, &lt;em&gt;DROID&lt;/em&gt;, &lt;em&gt;Fido&lt;/em&gt; and &lt;em&gt;JHOVE2&lt;/em&gt;) failed to even &lt;em&gt;run&lt;/em&gt; when executed with their associated launcher script. In many cases the &lt;em&gt;Windows&lt;/em&gt; launcher scripts (batch files) only worked when executed from the installation folder. Apart from making things unnecessarily difficult for the user, this also completely flies in the face of all existing conventions on command-line interface design. Around the time of this work (summer 2011) I had been in contact with the developers of all the evaluated tools, and until last week I thought those issues were a thing of the past. Well, was I wrong!&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;fits-08&quot;&gt;&lt;a href=&quot;http://projects.iq.harvard.edu/files/fits/files/fits-0.8.0.zip&quot;&gt;FITS 0.8&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;Fast-forward 2.5 years: this week I saw the announcement of the latest &lt;a href=&quot;http://projects.iq.harvard.edu/fits&quot;&gt;FITS&lt;/a&gt; release. This got me curious, also because of the recent work on this tool as part of the &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-11-06-fits-blitz&quot;&gt;FITS Blitz&lt;/a&gt;. So I downloaded &lt;a href=&quot;http://projects.iq.harvard.edu/files/fits/files/fits-0.8.0.zip&quot;&gt;FITS 0.8&lt;/a&gt;, installed it in a directory called &lt;em&gt;c:\fits\&lt;/em&gt;on my &lt;em&gt;Windows&lt;/em&gt; PC, and then typed (while being in directory &lt;em&gt;f:\myData\&lt;/em&gt;):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;f:&lt;span class=&quot;se&quot;&gt;\m&lt;/span&gt;yData&amp;gt;c:&lt;span class=&quot;se&quot;&gt;\f&lt;/span&gt;its&lt;span class=&quot;se&quot;&gt;\f&lt;/span&gt;its
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Instead of the expected helper message I ended up with this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;The system cannot find the path specified.
Error: Could not find or load main class edu.harvard.hul.ois.fits.Fits
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Hang on, I’ve seen this before … don’t tell me this is the same bug that I already reported 2.5 years ago ? Well, turns out &lt;a href=&quot;https://github.com/harvard-lts/fits/issues/10&quot;&gt;it is&lt;/a&gt; after all!&lt;/p&gt;

&lt;p&gt;This got me curious about the status of the other tools that had similar problems in 2011, so I started downloading the latest versions of &lt;a href=&quot;http://www.nationalarchives.gov.uk/information-management/our-services/dc-file-profiling-tool.htm&quot;&gt;&lt;em&gt;DROID&lt;/em&gt;&lt;/a&gt;, &lt;a href=&quot;https://bitbucket.org/jhove2/main/wiki/Home&quot;&gt;&lt;em&gt;JHOVE2&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;https://github.com/openplanets/fido&quot;&gt;&lt;em&gt;Fido&lt;/em&gt;&lt;/a&gt;. As I was on a roll anyway, I gave &lt;a href=&quot;http://jhove.sourceforge.net/&quot;&gt;JHOVE&lt;/a&gt; a try as well (even though it was not part of the 2011 evaluation). The objective of the test was simply to &lt;em&gt;run&lt;/em&gt; each tool and get some screen output (e.g. a help message), nothing more. I did these tests on a PC running &lt;em&gt;Windows&lt;/em&gt; 7 with &lt;em&gt;Java&lt;/em&gt; version 1.7.0_25. Here are the results.&lt;/p&gt;

&lt;h2 id=&quot;droid-613&quot;&gt;&lt;a href=&quot;http://www.nationalarchives.gov.uk/documents/information-management/droid-binary-6.1.3-bin.zip&quot;&gt;DROID 6.1.3&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;First I installed &lt;em&gt;DROID&lt;/em&gt; in a directory &lt;em&gt;C:\droid\&lt;/em&gt;. Then I executed it using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;f:&lt;span class=&quot;se&quot;&gt;\m&lt;/span&gt;yData&amp;gt;c:&lt;span class=&quot;se&quot;&gt;\d&lt;/span&gt;roid&lt;span class=&quot;se&quot;&gt;\d&lt;/span&gt;roid
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This started up a &lt;em&gt;Java Virtual Machine Launcher&lt;/em&gt; that showed this message box:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/01/droidError.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;Running DROID&lt;/em&gt; text document that comes with &lt;em&gt;DROID&lt;/em&gt; says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;To run DROID on Windows, use the “droid.bat” file.  You can either double-click on this file, or run it from the command-line console, by typing “droid” &lt;strong&gt;when you are in the droid installation folder&lt;/strong&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, no progress on this for &lt;em&gt;DROID&lt;/em&gt; either, then. I &lt;em&gt;was&lt;/em&gt; able to get &lt;em&gt;DROID&lt;/em&gt; running by circumventing the launcher script like this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; c:&lt;span class=&quot;se&quot;&gt;\d&lt;/span&gt;roid&lt;span class=&quot;se&quot;&gt;\d&lt;/span&gt;roid-command-line-6.1.3.jar
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in the following output:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;No command line options specified
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This isn’t particularly helpful. There &lt;em&gt;is&lt;/em&gt; a helper message, for which you have to give the &lt;em&gt;-h&lt;/em&gt; flag on the command line. But you don’t get to see this until you give the &lt;em&gt;-h&lt;/em&gt; flag on the command line. Catch 22 anyone?&lt;/p&gt;

&lt;h2 id=&quot;jhove2-210&quot;&gt;&lt;a href=&quot;http://bitbucket.org/jhove2/main/downloads/jhove2-2.1.0.zip&quot;&gt;JHOVE2-2.1.0&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;After installing &lt;em&gt;JHOVE2&lt;/em&gt; in &lt;em&gt;c:\jhove2\&lt;/em&gt;, I typed:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;f:&lt;span class=&quot;se&quot;&gt;\m&lt;/span&gt;yData&amp;gt;c:&lt;span class=&quot;se&quot;&gt;\j&lt;/span&gt;hove2&lt;span class=&quot;se&quot;&gt;\j&lt;/span&gt;hove2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This gave me &lt;strong&gt;1393&lt;/strong&gt; (yes, you read that right: 1393!) &lt;em&gt;Java&lt;/em&gt; deprecation warnings, each along the lines of:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;16:51:02,702 [main] WARN  TypeConverterDelegate : PropertyEditor [com.sun.beans.editors.EnumEditor]
found through deprecated global PropertyEditorManager fallback - consider using a more isolated
form of registration, e.g. on the BeanWrapper/BeanFactory!
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This was eventually followed by the (expected) &lt;em&gt;JHOVE2&lt;/em&gt; help message, and a quick test on some actual files confirmed that &lt;em&gt;JHOVE2&lt;/em&gt; &lt;em&gt;does&lt;/em&gt; actually work. Nevertheless, by the time the tsunami of warning messages is over, many first-time users will have started running for the bunkers!&lt;/p&gt;

&lt;h2 id=&quot;fido-131&quot;&gt;&lt;a href=&quot;https://github.com/openplanets/fido/releases/tag/1.3.1-70&quot;&gt;Fido 1.3.1&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Fido&lt;/em&gt; doesn’t make use of any launcher scripts any more, and the default way to run it is to use the &lt;em&gt;Python&lt;/em&gt; script directly. After installing in &lt;em&gt;c:\fido\&lt;/em&gt; I typed:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;f:&lt;span class=&quot;se&quot;&gt;\m&lt;/span&gt;yData&amp;gt;c:&lt;span class=&quot;se&quot;&gt;\f&lt;/span&gt;ido&lt;span class=&quot;se&quot;&gt;\f&lt;/span&gt;ido.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which resulted in ….. (drum roll) … a nicely formatted &lt;em&gt;Fido&lt;/em&gt; help message, which is exactly what I was hoping for. Beautiful!&lt;/p&gt;

&lt;h2 id=&quot;jhove-111&quot;&gt;&lt;a href=&quot;http://sourceforge.net/projects/jhove/files/latest/download&quot;&gt;JHOVE 1.11&lt;/a&gt;&lt;/h2&gt;

&lt;p&gt;I installed &lt;em&gt;JHOVE&lt;/em&gt; in &lt;em&gt;c:\jhove\&lt;/em&gt; and then typed:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;f:&lt;span class=&quot;se&quot;&gt;\m&lt;/span&gt;yData&amp;gt;c:&lt;span class=&quot;se&quot;&gt;\j&lt;/span&gt;hove&lt;span class=&quot;se&quot;&gt;\j&lt;/span&gt;hove 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Which resulted in this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;Exception in thread &quot;main&quot; java.lang.NoClassDefFoundError: edu/harvard/hul/ois/j
hove/viewer/ConfigWindow
        at edu.harvard.hul.ois.jhove.DefaultConfigurationBuilder.writeDefaultCon
figFile(Unknown Source)
        at edu.harvard.hul.ois.jhove.JhoveBase.init(Unknown Source)
        at Jhove.main(Unknown Source)
Caused by: java.lang.ClassNotFoundException: edu.harvard.hul.ois.jhove.viewer.Co
nfigWindow
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.net.URLClassLoader$1.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
        at java.lang.ClassLoader.loadClass(Unknown Source)
        ... 3 more
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Ouch!&lt;/p&gt;

&lt;h2 id=&quot;final-remarks&quot;&gt;Final remarks&lt;/h2&gt;

&lt;p&gt;I limited my tests to a &lt;em&gt;Windows&lt;/em&gt; environment only, and results may well be better under &lt;em&gt;Linux&lt;/em&gt; for some of these tools. Nevertheless, I find it nothing less than astounding that so many of these (often widely cited) preservation tools fail to even &lt;em&gt;execute&lt;/em&gt; on today’s &lt;a href=&quot;http://en.wikipedia.org/wiki/Usage_share_of_operating_systems&quot;&gt;most widespread operating system&lt;/a&gt;. Granted, in some cases there are workarounds, such as tweaking the launcher scripts, or circumventing them altogether. However, this is not an option for less tech-savvy users, who will simply conclude “&lt;em&gt;Hey, this tool doesn’t work&lt;/em&gt;”, give up, and move on to other things. Moreover, this means that much of the (often huge) amounts of development effort that went into these tools will simply fail to reach its potential audience, and I think this is a tremendous waste. I’m also wondering why there’s been so little progress on this over the past 2.5 years. Is it really that difficult to develop preservation tools with command-line interfaces that follow basic design conventions that have been ubiquitous elsewhere for more than 30 years? Tools that &lt;em&gt;just work&lt;/em&gt;?&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2014/01/31/why-cant-we-have-digital-preservation-tools-just-work/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2014/01/31/why-cant-we-have-digital-preservation-tools-just-work</link>
                <guid>https://bitsgalore.org/2014/01/31/why-cant-we-have-digital-preservation-tools-just-work</guid>
                <pubDate>2014-01-31T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Identification of PDF preservation risks&#58; analysis of Govdocs selected corpus</title>
                <description>&lt;p&gt;This blog follows up on three earlier posts about detecting  preservation risks in &lt;em&gt;PDF&lt;/em&gt; files. In  &lt;a href=&quot;/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression&quot;&gt;part 1&lt;/a&gt; I explored to what extent the &lt;a href=&quot;http://pdfbox.apache.org/cookbook/pdfavalidation.html&quot;&gt;&lt;em&gt;Preflight&lt;/em&gt;&lt;/a&gt; component of the &lt;a href=&quot;http://pdfbox.apache.org/&quot;&gt;&lt;em&gt;Apache PDFBox&lt;/em&gt;&lt;/a&gt; library can be used to detect specific preservation risks in &lt;em&gt;PDF&lt;/em&gt; documents. This was followed up by some work during the &lt;em&gt;SPRUCE&lt;/em&gt; Hackathon in Leeds, which is covered by &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-03-15-pdf-eh-another-hackathon-tale&quot;&gt;this blog post by Peter Cliff&lt;/a&gt;. Then last summer I did a series of &lt;a href=&quot;/2013/07/25/identification-pdf-preservation-risks-sequel&quot;&gt;additional tests&lt;/a&gt; using files from the &lt;a href=&quot;http://acroeng.adobe.com/wp/&quot;&gt;&lt;em&gt;Adobe Acrobat Engineering&lt;/em&gt; website&lt;/a&gt;. The main outcome of this more recent work was that, although showing great promise, &lt;em&gt;Preflight&lt;/em&gt; was struggling with many more complex &lt;em&gt;PDF&lt;/em&gt;s. Fast-forward another six months and, thanks to the excellent response of the &lt;em&gt;Preflight&lt;/em&gt; developers to our bug reports, the most serious of these problems are now largely solved&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. So, time to move on to the next step!&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;govdocs-selected&quot;&gt;Govdocs Selected&lt;/h2&gt;

&lt;p&gt;Ultimately, the aim of this work is to be able to profile large &lt;em&gt;PDF&lt;/em&gt; collections for specific preservation risks, or to verify that a &lt;em&gt;PDF&lt;/em&gt; conforms to an institute-specific policy before ingest. To get a better idea of how that might work in practice, I decided to do some tests with the &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2012-07-26-1-million-21000-reducing-govdocs-significantly&quot;&gt;&lt;em&gt;Govdocs Selected&lt;/em&gt;&lt;/a&gt; dataset, which is a subset of the &lt;a href=&quot;http://digitalcorpora.org/corpora/files&quot;&gt;Govdocs1&lt;/a&gt; corpus. As a first step I ran the latest version of &lt;em&gt;Preflight&lt;/em&gt; on every &lt;em&gt;PDF&lt;/em&gt; in the corpus (about 15 thousand)&lt;sup id=&quot;fnref:2&quot;&gt;&lt;a href=&quot;#fn:2&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;validation-errors&quot;&gt;Validation errors&lt;/h2&gt;

&lt;p&gt;As I was curious about the most common validation errors (or, more correctly, violations of the &lt;em&gt;PDF/A-1b&lt;/em&gt; profile), I ran a little post-processing script on the output files to calculate error occurrences. The following table lists the results. For each &lt;em&gt;Preflight&lt;/em&gt; error (which is represented as an error code), the table shows the number of &lt;em&gt;PDF&lt;/em&gt;s for which the error was reported (expressed as a percentage)&lt;sup id=&quot;fnref:3&quot;&gt;&lt;a href=&quot;#fn:3&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Error code&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;% PDFs reported&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Description (from &lt;a href=&quot;http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java&quot;&gt;Preflight source code&lt;/a&gt;)&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.4.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;79.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;color space used in the PDF file but the DestOutputProfile is missing&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;52.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid metadata found&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.4.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;39.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RGB color space used in the PDF file but the DestOutputProfile isn’t RGB&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;38.8&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Error on the object delimiters (obj / endobj)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.4.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;34.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ID in 1st trailer and the last is different&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.2.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;32.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The length of the stream dictionary and the stream length is inconsistent&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7.11&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;31.9&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;PDF/A Identification Schema not found&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;31.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Some mandatory fields are missing from the FONT Descriptor Dictionary&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;29.4&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Error on the “Font File x” in the Font Descriptor &lt;em&gt;(ed.:font not embedded?)&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;27.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Some mandatory fields are missing from the FONT Dictionary&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;17.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Width array and Font program Width are inconsistent&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.2.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;13&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The annotation uses a flag which is forbidden&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.4.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;12.8&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CMYK color space used in the PDF file but the DestOutputProfile isn’t CMYK&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.2.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;12&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Error on the stream delimiters (stream / endstream)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.2.12&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;9.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The stream uses a filter which isn’t defined in the PDF Reference document&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.4.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;9.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ID is missing from the trailer&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.11&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;8.4&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The CIDSet entry i mandatory from a subset of composite font&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;8.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Header syntax error&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.2.7&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The stream uses an invalid filter (The LZW)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Encoding is inconsistent with the Font&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;6.7&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;A XObject has an unexpected key defined&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Exception&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;6.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Preflight raised an exception&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.9&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;6.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The CIDToGID is invalid&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.4&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.7&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Charset declaration is missing in a Type 1 Subset&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Metadata mismatch between PDF Dictionnary and xmp&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Description schema required not embedded&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.3.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;4.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;A XObject has an unexpected value for a defined key&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;7.1.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Unknown metadata&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.3.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;a glyph is missing&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.4.8&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Optional content is forbidden&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.2.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.4&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;A XObject SMask value isn’t None&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0.14&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;An object has an invalid offset&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.4.10&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Last %%EOF sequence is followed by data&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;A Group entry with S = Transparency is used or the S = Null&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Syntax error&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.2.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Annotation uses a Color profile which isn’t the same than the profile contained by the OutputIntent&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The number is out of Range&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.3.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;The AP dictionary of the annotation contains forbidden/invalid entries (only the N entry is authorized)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;6.2.5&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;An explicitly forbidden action is used in the PDF file&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.4.7&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;EmbeddedFile entry is present in the Names dictionary&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;This table does look a bit intimidating (but see &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Summary+of+Apache+Preflight+errors&quot;&gt;this summary of &lt;em&gt;Preflight&lt;/em&gt; errors&lt;/a&gt;); nevertheless it is useful to point out a couple of general observations:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Some errors are &lt;em&gt;really&lt;/em&gt; common; for instance, error &lt;em&gt;2.4.3&lt;/em&gt; is reported for nearly 80% of all &lt;em&gt;PDF&lt;/em&gt;s in the corpus!&lt;/li&gt;
  &lt;li&gt;Errors related to color spaces, metadata and fonts are particularly common.&lt;/li&gt;
  &lt;li&gt;File structure errors (1.x range) are reported quite a lot as well. Although I haven’t looked at this in any detail, I expect that for some files these errors truly reflect a deviation from the &lt;em&gt;PDF/A-1&lt;/em&gt; profile, whereas in other cases these files may simply not be valid &lt;em&gt;PDF&lt;/em&gt; (which would be more serious).&lt;/li&gt;
  &lt;li&gt;About 6.5% of all analysed files raised an exception in &lt;em&gt;Preflight&lt;/em&gt;, which could either mean that something is seriously wrong with them, or alternatively it may point to bugs in &lt;em&gt;Preflight&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;policy-based-assessment&quot;&gt;Policy-based assessment&lt;/h2&gt;

&lt;p&gt;Although it’s easy to get overwhelmed by the &lt;em&gt;Preflight&lt;/em&gt; output above, we should keep in mind here that the ultimate aim of this work is &lt;em&gt;not&lt;/em&gt; to validate against &lt;em&gt;PDF/A-1&lt;/em&gt;, but to assess arbitrary &lt;em&gt;PDF&lt;/em&gt;s against a pre-defined technical profile. This profile may reflect an institution’s low-level preservation policies on the requirements a &lt;em&gt;PDF&lt;/em&gt; must meet to be deemed suitable for long-term preservation. In &lt;em&gt;SCAPE&lt;/em&gt; such low-level policies are called &lt;em&gt;control policies&lt;/em&gt;, and you can find more information on them &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-07-29-scape-creating-machine-understandable-policy-human-readable-policy&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-09-04-control-policies-scape-project&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To illustrate this, I’ll be using a hypothetical control policy for &lt;em&gt;PDF&lt;/em&gt; that is defined by the following objectives:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;File must not be encrypted or password protected&lt;/li&gt;
  &lt;li&gt;Fonts must be embedded and complete&lt;/li&gt;
  &lt;li&gt;File must not contain &lt;em&gt;JavaScript&lt;/em&gt;&lt;/li&gt;
  &lt;li&gt;File must not contain embedded files (i.e. file attachments)&lt;/li&gt;
  &lt;li&gt;File must not contain multimedia content (audio, video, 3-D objects)&lt;/li&gt;
  &lt;li&gt;File should be valid &lt;em&gt;PDF&lt;/em&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;Preflight&lt;/em&gt;’s output contains all the information that is needed to establish whether each objective is met (except objective 6, which would need a &lt;a href=&quot;http://duff-johnson.com/2014/01/24/are-your-documents-readable-how-would-you-know/&quot;&gt;full-fledged &lt;em&gt;PDF&lt;/em&gt; validator&lt;/a&gt;). By translating the above objectives into a set of &lt;a href=&quot;http://en.wikipedia.org/wiki/Schematron&quot;&gt;&lt;em&gt;Schematron&lt;/em&gt;&lt;/a&gt; rules, it is pretty straightforward to assess each &lt;em&gt;PDF&lt;/em&gt; in our dataset against the control policy. If that sounds familiar: this is the same approach that we used earlier for &lt;a href=&quot;/2012/09/04/automated-assessment-jp2-against-technical-profile&quot;&gt;assessing &lt;em&gt;JP2&lt;/em&gt; images against a technical profile&lt;/a&gt;. A schema that represents our control policy can be found &lt;a href=&quot;https://github.com/openplanets/pdfPolicyValidate/blob/master/schemas/pdf_policy_preflight_test.sch&quot;&gt;here&lt;/a&gt;. Note that this is only a first attempt, and it may well need some further fine-tuning (more about that later).&lt;/p&gt;

&lt;h2 id=&quot;results-of-assessment&quot;&gt;Results of assessment&lt;/h2&gt;

&lt;p&gt;As a first step I validated all &lt;em&gt;Preflight&lt;/em&gt; output files against &lt;a href=&quot;https://github.com/openplanets/pdfPolicyValidate/blob/master/schemas/pdf_policy_preflight_test.sch&quot;&gt;this schema&lt;/a&gt;. The result is rather disappointing:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Outcome&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Number of files&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;%&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Pass&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3973&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;26&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Fail&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;11120&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;74&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, only 26% of all &lt;em&gt;PDF&lt;/em&gt;s in &lt;em&gt;Govdocs Selected&lt;/em&gt; meet the requirements of our control policy! The figure below gives us some further clues as to why this is happening:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2014/01/failedAssertions_small.png&quot; alt=&quot;Failed assertions graph&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Here each bar represents the occurrences of individual failed tests in our &lt;a href=&quot;https://github.com/openplanets/pdfPolicyValidate/blob/master/schemas/pdf_policy_preflight_test.sch&quot;&gt;schema&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;font-errors-galore&quot;&gt;Font errors galore&lt;/h2&gt;

&lt;p&gt;What is clear here is that the majority of failed tests is font-related. The &lt;em&gt;Schematron&lt;/em&gt; rules that I used for the assessment currently includes &lt;em&gt;all&lt;/em&gt; font errors that are reported by &lt;em&gt;Preflight&lt;/em&gt;. Perhaps this is too strict on objective 2 (“&lt;em&gt;Fonts must be embedded and complete&lt;/em&gt;”). A particular difficulty here is that it is often hard to envisage the impact of particular font errors on the rendering process. On the other hand, the results are consistent with the outcome of a 2013 survey by the &lt;a href=&quot;http://www.pdfa.org/&quot;&gt;PDF Association&lt;/a&gt;, which showed that its members see fonts as the most challenging aspect of &lt;em&gt;PDF&lt;/em&gt;, both for processing and writing (source: &lt;a href=&quot;http://duff-johnson.com/wp-content/uploads/2014/01/PDFValidationDreamOrYawn.pdf&quot;&gt;this presentation&lt;/a&gt; by Duff Johnson). So, the assessment results may simply reflect that font problems are widespread&lt;sup id=&quot;fnref:4&quot;&gt;&lt;a href=&quot;#fn:4&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;. One should also keep in mind that &lt;em&gt;Govdocs selected&lt;/em&gt; was created by selecting on unique combinations of file properties from files in &lt;a href=&quot;http://digitalcorpora.org/corpora/files&quot;&gt;Govdocs1&lt;/a&gt;. As a result, one would expect this dataset to be more heterogeneous than most ‘typical’ &lt;em&gt;PDF&lt;/em&gt; collections, and this would also influence the results. For instance, the &lt;em&gt;Creating Program&lt;/em&gt; selection property could result in a relative over-representation of files that were produced by some crappy creation tool. Whether this is really the case could be easily tested by repeating this analysis for other collections.&lt;/p&gt;

&lt;h2 id=&quot;other-errors&quot;&gt;Other errors&lt;/h2&gt;

&lt;p&gt;Only a small small number of &lt;em&gt;PDF&lt;/em&gt;s with encryption, &lt;em&gt;JavaScript&lt;/em&gt;, embedded files and multimedia content were detected. I should add here that the occurrence of &lt;em&gt;JavaScript&lt;/em&gt; is probably underestimated due to a &lt;a href=&quot;https://issues.apache.org/jira/browse/PDFBOX-1754&quot;&gt;pending &lt;em&gt;Preflight&lt;/em&gt; bug&lt;/a&gt;. A major limitation is that there are currently no reliable tools that are able to test overall conformity to &lt;em&gt;PDF&lt;/em&gt;. This problem (and a hint at a solution) is also the subject of a recent &lt;a href=&quot;http://duff-johnson.com/2014/01/24/are-your-documents-readable-how-would-you-know/&quot;&gt;blog post by Duff Johnson&lt;/a&gt;. In the current assessment I’ve taken the occurrence of &lt;em&gt;Preflight&lt;/em&gt; exceptions (and general processing errors) as an indicator for non-validity. This is a pretty crude approximation, because some of these exceptions may simply indicate a bug in &lt;em&gt;Preflight&lt;/em&gt; (rather than a faulty &lt;em&gt;PDF&lt;/em&gt;). One of the next steps will therefore be a more in-depth look at some of the &lt;em&gt;PDF&lt;/em&gt;s that caused an exception.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;These preliminary results show that policy-based assessment of &lt;em&gt;PDF&lt;/em&gt; is possible using a combination of &lt;em&gt;Apache Preflight&lt;/em&gt; and &lt;em&gt;Schematron&lt;/em&gt;. However, dealing with font issues appears to be a particular challenge. Also, the lack of reliable tools to test for overall conformity to &lt;em&gt;PDF&lt;/em&gt; (e.g. &lt;a href=&quot;http://acroeng.adobe.com/PDFReference/ISO32000/PDF32000-Adobe.pdf&quot;&gt;ISO 32000&lt;/a&gt;) is still a major limitation. Another limitation of this analysis is the lack of ground truth, which makes it difficult to assess the accuracy of the results.&lt;/p&gt;

&lt;h2 id=&quot;demo-script-and-data-downloads&quot;&gt;Demo script and data downloads&lt;/h2&gt;

&lt;p&gt;For those who want to have a go at the analyses that I’ve presented here, I’ve created a simple &lt;a href=&quot;https://github.com/openplanets/pdfPolicyValidate&quot;&gt;demo script here&lt;/a&gt;. The raw output data of the &lt;em&gt;Govdocs selected&lt;/em&gt; corpus can be found &lt;a href=&quot;https://github.com/openplanets/preflightGovdocsSelected&quot;&gt;here&lt;/a&gt;. This includes all &lt;em&gt;Preflight&lt;/em&gt; files, the &lt;em&gt;Schematron&lt;/em&gt; output and the error counts. A download link for the &lt;em&gt;Govdocs selected&lt;/em&gt; corpus can be found at the bottom of &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2012-07-26-1-million-21000-reducing-govdocs-significantly&quot;&gt;this blog post&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Apache Preflight&lt;/em&gt; developers Eric Leleu, Andreas Lehmkühler and Guillaume Bailleul are thanked for their support and prompt response to my  questions and bug reports.&lt;/p&gt;

&lt;h2 id=&quot;related-blog-posts&quot;&gt;Related blog posts&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression&quot;&gt;Identification of PDF preservation risks with Apache Preflight: a first impression&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/2013/07/25/identification-pdf-preservation-risks-sequel&quot;&gt;Identification of PDF preservation risks: the sequel&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://duff-johnson.com/2014/01/24/are-your-documents-readable-how-would-you-know/&quot;&gt;Are your documents readable? How would you know? (Duff Johnson)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2012-07-26-1-million-21000-reducing-govdocs-significantly&quot;&gt;From 1 Million to 21,000: Reducing Govdocs Significantly (Dave Tarrant)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-07-29-scape-creating-machine-understandable-policy-human-readable-policy&quot;&gt;Creating machine understandable policy from human readable policy (Catherine Jones)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-09-04-control-policies-scape-project&quot;&gt;Control Policies in the SCAPE Project (Sean Bechhofer)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Portable+Document+Format&quot;&gt;PDF on the  OPF File Format Risk Registry&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;This was already suggested by &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Analysis+of+Acrobat+Engineering+PDFs+with+Acrobat+Preflight+and+Apache+Preflight&quot;&gt;this re-analysis of the &lt;em&gt;Acrobat Engineering&lt;/em&gt; files &lt;/a&gt; that I did in November. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:2&quot;&gt;
      &lt;p&gt;This selection was only based on file extension, which introduces the possibility that some of these files aren’t really &lt;em&gt;PDF&lt;/em&gt;s. &lt;a href=&quot;#fnref:2&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:3&quot;&gt;
      &lt;p&gt;Errors that were reported for less than 1% of all analysed &lt;em&gt;PDF&lt;/em&gt;s are not included in the table. &lt;a href=&quot;#fnref:3&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
    &lt;li id=&quot;fn:4&quot;&gt;
      &lt;p&gt;In addition to this, it seems that &lt;em&gt;Preflight&lt;/em&gt; &lt;a href=&quot;https://issues.apache.org/jira/browse/PDFBOX-1864&quot;&gt;sometimes fails to detect fonts that are not embedded&lt;/a&gt;, so the number of &lt;em&gt;PDF&lt;/em&gt;s with font issues may be even greater than this test suggests. &lt;a href=&quot;#fnref:4&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus</link>
                <guid>https://bitsgalore.org/2014/01/27/identification-pdf-preservation-risks-analysis-govdocs-selected-corpus</guid>
                <pubDate>2014-01-27T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Measuring Bigfoot</title>
                <description>&lt;p&gt;My previous blog &lt;a href=&quot;/2013/09/30/assessing-file-format-risks-searching-bigfoot&quot;&gt;&lt;em&gt;Assessing file format risks: searching for Bigfoot?&lt;/em&gt;&lt;/a&gt; resulted in some interesting feedback from a number of people. There was a particularly elaborate response from Ross Spencer, and I originally wanted to reply to that directly using the comment fields. However, my reply turned out to be a bit more lengthy than I meant to, so I decided to turn it into a separate blog entry.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;numbers-first&quot;&gt;Numbers first?&lt;/h2&gt;

&lt;p&gt;Ross overall point is that &lt;a href=&quot;http://www.openplanetsfoundation.org/comment/511#comment-511&quot;&gt;we need the numbers first&lt;/a&gt;; he makes a plea for collecting more format-related data, and adding numbers to these. Although these data do not directly translate into risks, Ross argues that it might be possible to use them to address format risks at a later stage. This may look like a sensible approach at first glance, but on closer inspection there’s a pretty fundamental problem, which I’ll try to explain below. To avoid any confusion here, I will be speaking of “format risk” here in the sense used by &lt;a href=&quot;http://purl.pt/24107/1/iPres2013_PDF/A%20Risk%20Analysis%20of%20File%20Formats%20for%20Preservation%20Planning.pdf&quot;&gt;Graf &amp;amp; Gordea&lt;/a&gt;, which follows from the idea of “institutional obsolescence” (which is probably worth a blog post by itself, but I won’t go into this here).&lt;/p&gt;

&lt;h2 id=&quot;the-risk-model&quot;&gt;The risk model&lt;/h2&gt;

&lt;p&gt;Graf &amp;amp; Gordea define institutional obsolescence in terms of “the additional effort required to render a file beyond the capability of a regular PC setup in particular institution”. Let’s call this effort &lt;em&gt;E&lt;/em&gt;. Now the aim is to arrive at an index that has some predictive power of &lt;em&gt;E&lt;/em&gt;. Let’s call this index &lt;em&gt;R&lt;sub&gt;E&lt;/sub&gt;&lt;/em&gt;. For the sake of the argument it doesn’t matter how &lt;em&gt;R&lt;sub&gt;E&lt;/sub&gt;&lt;/em&gt; is defined precisely, but it’s reasonable to assume it will be proportional to &lt;em&gt;E&lt;/em&gt; (i.e. as the effort to render a file increases, so does the risk):&lt;/p&gt;

&lt;p&gt;&lt;em&gt;R&lt;sub&gt;E&lt;/sub&gt;&lt;/em&gt; ∝ &lt;em&gt;E&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The next step is to find a way to estimate &lt;em&gt;R&lt;sub&gt;E&lt;/sub&gt;&lt;/em&gt; (the dependent variable) as a function of a set of potential predictor variables:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;R&lt;sub&gt;E&lt;/sub&gt;&lt;/em&gt; = &lt;em&gt;f&lt;/em&gt;(&lt;em&gt;S&lt;/em&gt;, &lt;em&gt;P&lt;/em&gt;, &lt;em&gt;C&lt;/em&gt;, … )&lt;/p&gt;

&lt;p&gt;where &lt;em&gt;S&lt;/em&gt; = software count, &lt;em&gt;P&lt;/em&gt; = popularity, &lt;em&gt;C&lt;/em&gt; = complexity, and so on. To establish the predictor function we have two possibilities:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;use a statistical approach (e.g. multiple regression or something more sophisticated);&lt;/li&gt;
  &lt;li&gt;use a conceptual model that is based on prior knowledge of how the predictor variables affect &lt;em&gt;R&lt;sub&gt;E&lt;/sub&gt;&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first case (statistical approach) is only feasible if we have actual data on &lt;em&gt;E&lt;/em&gt;. For the second case we also need observations on &lt;em&gt;E&lt;/em&gt;, if only to be able to say anything about the model’s ability to predict &lt;em&gt;R&lt;sub&gt;E&lt;/sub&gt;&lt;/em&gt; (verification).&lt;/p&gt;

&lt;h2 id=&quot;no-observed-data-on-e&quot;&gt;No observed data on &lt;em&gt;E&lt;/em&gt;!&lt;/h2&gt;

&lt;p&gt;Either way, the problem here is that there’s an almost complete lack of any data on &lt;em&gt;E&lt;/em&gt;. Although we may have a handful of isolated ‘war stories’, these don’t even come close to the amount of data that would be needed to support any risk model, no matter whether it is purely statistical or based on an underlying conceptual model&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;. So how are we going to model a quantity for which we do not have any observed data in the first place? Or am I overlooking something here?&lt;/p&gt;

&lt;p&gt;Looking at Ross’s suggestions for collecting more data, all of the examples he provides fall into the &lt;em&gt;potential&lt;/em&gt; (!) predictor variables category. For instance, prompted by my observation on compression in &lt;em&gt;PDF&lt;/em&gt;, Ross suggests to start analysing large collections of &lt;em&gt;PDF&lt;/em&gt;s to establish patterns on the occurrence of various types of compression (and other features), and attach numbers to them. Ross acknowledges that such numbers by themselves don’t tell you if &lt;em&gt;PDF&lt;/em&gt; is “riskier” than another format, but he argues that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;once we’ve got them  [the numbers], subject matter experts and maybe some of those mathematical types with far greater statistics capability than my own might be able to work with us to do something just a little bit clever with them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Aside from the fact that it’s debatable whether, in practical terms, the use of compression is really a risk (is there any evidence to back up this claim?), there’s a more fundamental issue here. Bearing in mind that, ultimately, the thing we’re &lt;em&gt;really&lt;/em&gt; interested in here is &lt;em&gt;E&lt;/em&gt;, how could collecting more data on potential predictor variables of &lt;em&gt;E&lt;/em&gt; ever help here &lt;em&gt;in the near absence of any actual data&lt;/em&gt; on &lt;em&gt;E&lt;/em&gt;? No amount of clever maths or statistics  can compensate for that! Meanwhile, ongoing work on the prediction of &lt;em&gt;E&lt;/em&gt; mainly seems to be focused on the collection, aggregation and analysis of potential predictor variables (which is also illustrated by Ross’s suggestions), even though the purpose of these efforts remains largely unclear.&lt;/p&gt;

&lt;p&gt;Within this context I was quite intrigued by the grant proposal mentioned by &lt;a href=&quot;http://www.openplanetsfoundation.org/comment/513#comment-513&quot;&gt;Andrea Goethals&lt;/a&gt; which, from the description, looks like an actual (and quite possibly the first) attempt at the systematic collection of data on &lt;em&gt;E&lt;/em&gt; (although like &lt;a href=&quot;http://www.openplanetsfoundation.org/comment/513#comment-513&quot;&gt;Andy Jackson said here&lt;/a&gt; I’m also wondering whether this may be too ambitious).&lt;/p&gt;

&lt;h2 id=&quot;obsolescence-related-risks-versus-format-instance-risks&quot;&gt;Obsolescence-related risks versus format instance risks&lt;/h2&gt;

&lt;p&gt;On a final note, Ross makes the following remark about the role of tools:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[W]ith tools such as Jpylyzer we have such powerful ways of measuring formats - and more and more should appear over time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is true to some extent, but a tool like &lt;em&gt;jpylyzer&lt;/em&gt; only provides information on format &lt;em&gt;instances&lt;/em&gt; (i.e. features of &lt;em&gt;individual files&lt;/em&gt;); it doesn’t say anything about preservation risks of the JP2 format &lt;em&gt;in general&lt;/em&gt;. The same applies to tools that are are able to &lt;a href=&quot;/2013/07/25/identification-pdf-preservation-risks-sequel&quot;&gt;detect features in individual PDF files&lt;/a&gt; that are risky from a long-term preservation point of view. Such risks affect file instances of &lt;em&gt;current&lt;/em&gt; formats, and this is an area that is covered by the &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/OPF+File+Format+Risk+Registry&quot;&gt;OPF File Format Risk Registry&lt;/a&gt; that is being developed within SCAPE (it only covers a limited number of formats). They are largely unrelated to (institutional) format obsolescence, which is the domain that is being addressed by &lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/&quot;&gt;&lt;em&gt;FFMA&lt;/em&gt;&lt;/a&gt;. This distinction is important, because both types of risks need to be tackled in fundamentally different ways, using different tools, methods and data. Also, by not being clear about which risks are being addressed, we may end up not using our data in the best possible way. For example, Ross’s suggestion on compression in &lt;em&gt;PDF&lt;/em&gt; entails (if I’m understanding him correctly) the analysis of large volumes of &lt;em&gt;PDF&lt;/em&gt;s in order to gather statistics on the use of different  compression types. Since such statistics say little about individual file instances, a more practically useful approach might be to profile individual files instances for ‘risky’ features.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/10/08/measuring-bigfoot/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;On a side note even conceptual models often need to be fine-tuned against observed data, which can make them pretty similar to statistically-derived models. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2013/10/08/measuring-bigfoot</link>
                <guid>https://bitsgalore.org/2013/10/08/measuring-bigfoot</guid>
                <pubDate>2013-10-08T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Assessing file format risks&#58; searching for Bigfoot?</title>
                <description>&lt;p&gt;Last week someone pointed my attention to a recent &lt;em&gt;iPres&lt;/em&gt; paper by Roman Graf and Sergiu Gordea titled “&lt;a href=&quot;http://purl.pt/24107/1/iPres2013_PDF/A%20Risk%20Analysis%20of%20File%20Formats%20for%20Preservation%20Planning.pdf&quot;&gt;A Risk Analysis of File Formats for Preservation Planning&lt;/a&gt;”. The authors propose a methodology for assessing preservation risks for file formats using information in publicly available information sources. In short, their approach involves two stages:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Collect and aggregate information on file formats from data sources such as &lt;a href=&quot;http://www.nationalarchives.gov.uk/PRONOM&quot;&gt;PRONOM&lt;/a&gt;, &lt;a href=&quot;http://www.freebase.com/&quot;&gt;Freebase&lt;/a&gt; and &lt;a href=&quot;http://dbpedia.org&quot;&gt;DBPedia&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Use this information to compute scores for a number of pre-defined risk factors (e.g. the number of software applications that support the format, the format’s complexity, its popularity, and so on). A weighted average of these individual scores then gives an overall risk score.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This has resulted in the “File Format Metadata Aggregator” (&lt;em&gt;FFMA&lt;/em&gt;), which is an expert system aimed at establishing a “&lt;em&gt;well structured knowledge base with defined rules and scored metrics that is intended to provide decision making support for preservation experts&lt;/em&gt;”.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;The paper caught my attention for two reasons: first, a number of years ago some colleagues at the &lt;em&gt;KB&lt;/em&gt; developed a &lt;a href=&quot;http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/KB_file_format_evaluation_method_27022008.pdf&quot;&gt;method for evaluating file formats&lt;/a&gt; that is based on a similar way of looking at preservation risks. Second, just a few weeks ago I found out that the University of North Carolina is also working on &lt;a href=&quot;http://www.ils.unc.edu/digccurr/ct_poster/Ryan.pdf&quot;&gt;a method for assessing “File Format Endangerment”&lt;/a&gt; which seems to be following a similar approach. Now let me start by saying that I’m extremely uneasy about assessing preservation risks in this way. To a large extent this is based on experiences with the &lt;em&gt;KB&lt;/em&gt;-developed method, which is similar to the assessment method behind &lt;em&gt;FFMA&lt;/em&gt;. I will use the remainder of this blog post to explain my reservations.&lt;/p&gt;

&lt;h2 id=&quot;criteria-are-largely-theoretical&quot;&gt;Criteria are largely theoretical&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;FFMA&lt;/em&gt; implicitly assumes that it is possible to assess format-specific preservation risks by evaluating formats against a list of pre-defined criteria. In this regard it is similar to (and builds on) the logic behind, to name but two examples, &lt;a href=&quot;http://www.digitalpreservation.gov/formats/sustain/sustain.shtml&quot;&gt;Library of Congress’ Sustainability Factors&lt;/a&gt; and &lt;a href=&quot;http://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf&quot;&gt;UK National Archives’ format selection criteria&lt;/a&gt;. However, these criteria are largely based on theoretical considerations, without being backed up by any empirical data. As a result, their predictive value is largely unknown.&lt;/p&gt;

&lt;h2 id=&quot;appropriateness-of-measures&quot;&gt;Appropriateness of measures&lt;/h2&gt;

&lt;p&gt;Even if we agree that criteria such as software support and the existence of migration paths to some alternative format are important, how exactly do we measure this? It is pretty straightforward to simply count the number of supporting software products or migration paths, but this says nothing about their &lt;em&gt;quality&lt;/em&gt; or suitability for a specific task. For example, &lt;em&gt;PDF&lt;/em&gt; is supported by a plethora of software tools, yet it is well known that few of them support &lt;em&gt;every&lt;/em&gt; feature of the format (possibly even none, with the exception of Adobe’s implementation). Here’s another example: quite a few (open-source)  software tools support the &lt;em&gt;JP2&lt;/em&gt; format, but for this many of them (including &lt;em&gt;ImageMagick&lt;/em&gt; and &lt;em&gt;GraphicsMagick&lt;/em&gt;) rely on &lt;a href=&quot;http://www.ece.uvic.ca/~frodo/jasper/&quot;&gt;&lt;em&gt;JasPer&lt;/em&gt;&lt;/a&gt;, a JPEG 2000 library that is notorious for its poor performance and stability. So even if a format is supported by lots of tools, this will be of little use if the quality of those tool are poor.&lt;/p&gt;

&lt;h2 id=&quot;risk-model-and-weighting-of-scores&quot;&gt;Risk model and weighting of scores&lt;/h2&gt;

&lt;p&gt;Just as the employed criteria are largely theoretical, so is the computation of the risk scores, the weights that are assigned to each risk factor, and they way the individual scores are aggregated into an overall score. The latter is computed as the weighted sum of all individual scores, which means that a poor score on, for example, &lt;em&gt;Software Count&lt;/em&gt; can be compensated by a high score on other factors. This doesn’t strike me as very realistic, and it is also at odds with e.g. &lt;a href=&quot;http://blog.dshr.org/2009/01/are-format-specifications-important-for.html&quot;&gt;David Rosenthal’s view&lt;/a&gt; of formats with open source renderers being immune from format obsolescence.&lt;/p&gt;

&lt;h2 id=&quot;accuracy-of-underlying-data&quot;&gt;Accuracy of underlying data&lt;/h2&gt;

&lt;p&gt;A cursory look at the &lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/&quot;&gt;web service implementation of &lt;em&gt;FFMA&lt;/em&gt;&lt;/a&gt; revealed some results that make me wonder about the data that are used for the risk assessment. According to &lt;em&gt;FFMA&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=png&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;&lt;em&gt;PNG&lt;/em&gt;&lt;/a&gt;, &lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=jpg&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;&lt;em&gt;JPG&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=gif&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;GIF&lt;/a&gt; are uncompressed formats (they’re not!);&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=pdf&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;&lt;em&gt;PDF&lt;/em&gt;&lt;/a&gt; is &lt;em&gt;not&lt;/em&gt; a compressed format (in reality text in &lt;em&gt;PDF&lt;/em&gt; nearly always uses &lt;a href=&quot;http://en.wikipedia.org/wiki/DEFLATE&quot;&gt;Flate compression&lt;/a&gt;, whereas a &lt;a href=&quot;http://www.prepressure.com/pdf/basics/compression&quot;&gt;whole array of compression methods&lt;/a&gt; may be used for images);&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=jp2&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;&lt;em&gt;JP2&lt;/em&gt;&lt;/a&gt; is not supported by &lt;em&gt;any&lt;/em&gt; software (Software Count=0!), it doesn’t have a &lt;em&gt;MIME&lt;/em&gt; type, it is frequently used, and it is supported by web browsers (all wrong, although arguably &lt;em&gt;some&lt;/em&gt; browser support exists if you account for external plugins);&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=jpf&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;&lt;em&gt;JPX&lt;/em&gt;&lt;/a&gt; is &lt;em&gt;not&lt;/em&gt; a compressed format and it is less complex than &lt;em&gt;JP2&lt;/em&gt; (in reality it is an extension of &lt;em&gt;JP2&lt;/em&gt; with added complexity).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To some extent this may also explain the peculiar ranking of formats in Figure 6 of the paper, which marks down &lt;em&gt;PDF&lt;/em&gt; and &lt;em&gt;MS Word&lt;/em&gt; (!) as formats with a lower risk than &lt;em&gt;TIFF&lt;/em&gt; (&lt;em&gt;GIF&lt;/em&gt; has the overall lowest score).&lt;/p&gt;

&lt;h2 id=&quot;what-risks&quot;&gt;What risks?&lt;/h2&gt;

&lt;p&gt;It is important that the concept of ‘preservation risk’ as addressed by &lt;em&gt;FFMA&lt;/em&gt; is closely related to (and has its origins in) the idea of formats becoming obsolete over time. This idea is controversial, and the authors do acknowledge this by defining preservation risks in terms of the “&lt;em&gt;additional effort required to render a file beyond the capability of a regular PC setup in [a] particular institution&lt;/em&gt;”. However, in its current form &lt;em&gt;FFMA&lt;/em&gt; only provides generalized information about formats, without addressing specific risks &lt;em&gt;within&lt;/em&gt; formats. A good example of this is &lt;em&gt;PDF&lt;/em&gt;, which may contain various features that are &lt;a href=&quot;/2012/07/26/pdf-inventory-long-term-preservation-risks&quot;&gt;problematic&lt;/a&gt; for long-term preservation. Also note how &lt;em&gt;PDF&lt;/em&gt; is marked as a &lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=pdf&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;&lt;em&gt;low-risk&lt;/em&gt; format&lt;/a&gt;, despite the fact that it can be a container for &lt;em&gt;JP2&lt;/em&gt; which is considered &lt;a href=&quot;http://ffma.ait.ac.at:8080/preservation-riskmanagement/rest/loddataanalysis/html/riskscorereport?name=jp2&amp;amp;configName=&amp;amp;classificationName=&quot;&gt;&lt;em&gt;high-risk&lt;/em&gt;&lt;/a&gt;. So doesn’t that imply that  a &lt;em&gt;PDF&lt;/em&gt; that contains &lt;em&gt;JPEG 2000&lt;/em&gt; compressed images is at a higher risk?&lt;/p&gt;

&lt;h2 id=&quot;encyclopedia-replacing-expertise&quot;&gt;Encyclopedia replacing expertise?&lt;/h2&gt;

&lt;p&gt;A possible response to the objections above would be to refine &lt;em&gt;FFMA&lt;/em&gt;: adjust the criteria, modify the way the individual risk scores are computed, tweak the weights, change the way the overall score is computed from the individual scores, and improve the underlying data. Even though I’m sure this could lead to some improvement, I’m eerily reminded here of &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-09-13-registries-we-need&quot;&gt;this recent &lt;strike&gt;rant&lt;/strike&gt; blog post&lt;/a&gt; by Andy Jackson, in which he shares his concerns about the archival community’s preoccupation with format, software, and hardware registries. Apart from the question whether the existing registries are actually helpful in solving real-world problems, Jackson suggests that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Maybe we don’t know what information we need?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;and:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Maybe we don’t even know who or what we are building registries for?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;He also wonders:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Are we trying to replace imagination and expertise with an encyclopedia?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I think these comments apply equally well to the recurring attempts at reducing format-specific preservation risks to numerical risk factors, scores and indices. This approach simply doesn’t do justice to the subtleties of practical digital preservation. Worse still, I see a potential danger of non-experts taking the results from such expert systems at face value, which can easily lead to ill-judged decisions. Here’s an example.&lt;/p&gt;

&lt;h2 id=&quot;kb-example&quot;&gt;KB example&lt;/h2&gt;

&lt;p&gt;About five years some colleagues at the &lt;em&gt;KB&lt;/em&gt; developed a “quantifiable file format risk assessment method”, which is described in &lt;a href=&quot;http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/KB_file_format_evaluation_method_27022008.pdf&quot;&gt;this report&lt;/a&gt;. This method was &lt;a href=&quot;http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/Alternative_File_Formats_for_Storing_Masters_2_1.pdf&quot;&gt;applied&lt;/a&gt; to decide which still image format was the best candidate to replace the then-current format for digitisation masters. The outcome of this was used to justify a change from uncompressed &lt;em&gt;TIFF&lt;/em&gt; to &lt;em&gt;JP2&lt;/em&gt;. It was only much later that we found out about a host of practical and standard-related problems with the format, some of which are discussed &lt;a href=&quot;http://jpeg2000wellcomelibrary.blogspot.nl/2010/12/guest-post-ensuring-suitability-of-jpeg.html&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html&quot;&gt;here&lt;/a&gt;. &lt;em&gt;None&lt;/em&gt; of these problems were accounted for by the earlier risk assessment method (and I have a hard time seeing how they ever could be)! The risk factor approach of &lt;em&gt;GGMA&lt;/em&gt; is covering similar ground, and this adds to my scepticism about addressing preservation risks in this manner.&lt;/p&gt;

&lt;h2 id=&quot;final-thoughts&quot;&gt;Final thoughts&lt;/h2&gt;

&lt;p&gt;Taking into account the problems mentioned in this blog post, I have a hard time seeing how scoring models such as the one used by &lt;em&gt;FFMA&lt;/em&gt; would help in solving practical digital preservation issues. It also makes me wonder why this idea keeps on being revisited over and over again. Similar to the format registry situation, is this perhaps another manifestation of the “&lt;em&gt;trying to replace imagination and expertise with an encyclopedia&lt;/em&gt; phenomenon? What exactly is the point of classifying or ranking formats according to perceived preservation “risks” if these “risks” are largely based on theoretical considerations, and are so general that they say next to nothing about individual file (format) instances? Isn’t this all a bit like &lt;a href=&quot;http://www.searchingforbigfoot.com/&quot;&gt;searching for Bigfoot&lt;/a&gt;? Wouldn’t the time and effort involved in these activities be better spent on trying to solve, document and publish concrete format-related problems and their solutions? Some examples can be found &lt;a href=&quot;http://unsustainableideas.wordpress.com/2012/10/15/ppt-4-adventure-learning/&quot;&gt;here&lt;/a&gt; (accessing old Powerpoint 4 files), &lt;a href=&quot;http://notepad.benfinoradin.info/2013/09/12/it-takes-a-village-to-save-a-hard-drive/&quot;&gt;here&lt;/a&gt; (recovering the contents of an old Commodore Amiga hard disk), &lt;a href=&quot;http://anjackson.github.io/keeping-codes/experiments/BBC%20Micro%20Data%20Recovery.html&quot;&gt;here&lt;/a&gt; (BBC Micro Data Recovery), or even &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/OPF+File+Format+Risk+Registry&quot;&gt;here&lt;/a&gt; (problems with contemporary formats)?&lt;/p&gt;

&lt;p&gt;I think there could also be a valuable role here for some of the &lt;em&gt;FFMA&lt;/em&gt;-related work in all this: the aggregation component of &lt;em&gt;FFMA&lt;/em&gt; looks really useful for the automatic discovery of, for example, software applications that are able to read a specific format, and this could be could be hugely helpful in solving real-world preservation problems.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/09/30/assessing-file-format-risks-searching-bigfoot/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2013/09/30/assessing-file-format-risks-searching-bigfoot</link>
                <guid>https://bitsgalore.org/2013/09/30/assessing-file-format-risks-searching-bigfoot</guid>
                <pubDate>2013-09-30T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Optimising archival JP2s for the derivation of access copies</title>
                <description>&lt;p&gt;Like many other organisations that are using JPEG 2000, the KB produces two representations of most of its digitised content (newspapers, books, periodicals):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;a high-quality, losslessly compressed JP2 that is the archival master;&lt;/li&gt;
  &lt;li&gt;a lesser-quality, lossily compressed JP2 that is used as an access image (this is used for e.g. our &lt;a href=&quot;https://resolver.kb.nl/resolve?urn=ddd:010620674:mpeg21:a0201&quot;&gt;newspapers website&lt;/a&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The majority of our digitisation work is contracted out to external suppliers, and both master and access images are typically derived from from a parent (TIFF) image, which is converted to JP2 using the settings for master and access images, respectively. This means that we’re not currently using the archival masters for producing derived images. However, there may be a need for this at some point in the future. For instance, we may need higher quality access images, or access images that give better performance in our access environment. Because of this, I was asked to take a further look into ways to derive access JP2s directly from our archival masters.&lt;/p&gt;

&lt;p&gt;In this blog post I’ll be sharing some preliminary findings of this work, which may be of interest to other JPEG 2000 practitioners as well. All images and test results that I’ll be showing along the way are available from this &lt;a href=&quot;https://github.com/bitsgalore/JP2AccessGeneration&quot;&gt;Github repository&lt;/a&gt;, so you can have a go at these data yourself, if you’re so inclined.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;masters-vs-access-images&quot;&gt;Masters vs access images&lt;/h2&gt;

&lt;p&gt;To better understand the remainder of this blog post it is helpful to outline the differences between our masters and our access images. The tables below list the encoding-related specifications of both.&lt;/p&gt;

&lt;h3 id=&quot;specifications-master&quot;&gt;Specifications master&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Parameter&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File format&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JP2 (JPEG 2000 Part 1)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Compression type&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Lossless (reversible 5-3 wavelet filter)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Colour transform&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes (only for colour images)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of decomposition levels&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Progression order&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RPCL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Tile size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1024 x 1024&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Code block size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64 x 64 (2&lt;sup&gt;6&lt;/sup&gt; x 2&lt;sup&gt;6&lt;/sup&gt;)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of quality layers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Error resilience&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Start-of-packet headers; end-of-packet headers; segmentation symbols&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h3 id=&quot;specifications-access&quot;&gt;Specifications access&lt;/h3&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Parameter&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File format&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JP2 (JPEG 2000 Part 1)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Compression type&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Lossy (irreversible 9-7 wavelet filter)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Colour transform&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes (only for colour images)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of decomposition levels&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Progression order&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RPCL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Tile size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1024 x 1024&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Code block size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64 x 64 (2&lt;sup&gt;6&lt;/sup&gt; x 2&lt;sup&gt;6&lt;/sup&gt;)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Precinct size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;256 x 256 (2&lt;sup&gt;8&lt;/sup&gt;) for 2 highest resolution levels; 128 x 128 (2&lt;sup&gt;7&lt;/sup&gt;) for remaining resolution levels&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of quality layers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;8&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Target compression ratio layers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2560:1 [1] ; 1280:1 [2] ;  640:1 [3] ; 320:1 [4] ; 160:1 [5] ; 80:1 [6] ; 40:1 [7] ; 20:1 [8]&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Error resilience&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Start-of-packet headers; end-of-packet headers; segmentation symbols&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;The main differences between the two are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;access images are compressed lossily (reduced file size), whereas lossless compression is used for the masters;&lt;/li&gt;
  &lt;li&gt;access images contain quality layers (enables progressive decoding), whereas the masters don’t;&lt;/li&gt;
  &lt;li&gt;access images use precincts (optimises performance while panning across zoomed-in regions), which aren’t used in the masters either.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;methods&quot;&gt;Methods&lt;/h2&gt;

&lt;p&gt;So, the central question here is: if we have an image that was encoded according to the master specifications, how can we derive an image from this that conforms to our access specifications? To find out, I did a number of tests on the image &lt;em&gt;balloon_master.jp2&lt;/em&gt;, which was created according to the KB’s master specifications. It looks like this (surprise, surprise!):&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2013/08/balloon_lossless.png&quot; alt=&quot;Balloon, master&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I tried to derive an access image from this master using 2 popular JPEG 2000 software toolkits:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.kakadusoftware.com/&quot;&gt;Kakadu&lt;/a&gt; (v 7.2.2)&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.aware.com/imaging/jpeg2000sdk.html&quot;&gt;Aware JPEG 2000 SDK&lt;/a&gt; (v 3.19.0.0)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For both  software packages I limited myself to using only the pre-compiled binaries (i.e. the &lt;em&gt;kdu_..&lt;/em&gt; demo tools for Kakadu, and the &lt;em&gt;j2kdriver&lt;/em&gt; command-line tool for Aware).&lt;/p&gt;

&lt;h2 id=&quot;kakadu&quot;&gt;Kakadu&lt;/h2&gt;

&lt;p&gt;Kakadu’s &lt;em&gt;kdu_compress&lt;/em&gt; tool doesn’t accept any of the JPEG 2000 formats as &lt;em&gt;input&lt;/em&gt;; however, it does include a &lt;em&gt;kdu_transcode&lt;/em&gt; tool which is capable of a wide array of reformatting operations. I should add here that &lt;em&gt;kdu_transcode&lt;/em&gt; is primarily intended as a demo tool that showcases Kakadu’s codestream reformatting capabilities, and it doesn’t produce output in the JP2 format (for a detailed explanation by Kakadu’s author look &lt;a href=&quot;http://tech.groups.yahoo.com/group/kakadu_jpeg2000/message/6777&quot;&gt;here&lt;/a&gt;)&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;However, &lt;em&gt;kdu_transcode&lt;/em&gt; is capable of wrapping output in a &lt;em&gt;JPX&lt;/em&gt; container (which can be made  &lt;em&gt;JP2&lt;/em&gt;-compatible), so this is what I used for these tests. To keep things simple, I started out by instructing the tool to create an output image with a 20:1 compression ratio (ignoring any of the layer / precinct requirements). For an RGB image with 8 bits/component this corresponds to an equivalent bitrate of 1.2, so I ended up with the following command line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kdu_transcode &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; balloon_master.jp2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; balloon_access_kdu.jpf &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-jpx_layers&lt;/span&gt; sRGB,0,1,2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;Sprofile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;PROFILE2 &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt; 1.2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The resulting output image did have the expected size, but opening it in an image viewer revealed a problem:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2013/08/balloon_access_kdu.png&quot; alt=&quot;Balloon, transcoded by Kakadu&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Compared to the source image, most of the colour information has gone, resulting in a representation that is largely grayscale. The reason behind this seemingly unexpected result is fairly simple: when &lt;em&gt;kdu_transcode&lt;/em&gt; creates the derived (lower quality) image, it does so by discarding some of the information that makes up the source image. In other words, it doesn’t decode and recompress the image, but instead re-arranges the compressed image data (which is a very fast process). For a source image with multiple quality layers, the result would be largely equivalent to discarding some of the highest quality layers. However, our source image only has one single quality layer, so this isn’t possible here. Instead, we end up with a result in which most of the colour information is missing (my guess is that the exact behaviour in such cases also depends on the progression order that was used for encoding the source image). Importantly, this is &lt;em&gt;not&lt;/em&gt; a flaw of the tool, but simply a consequence of the way the source image was formatted upon its creation.&lt;/p&gt;

&lt;h2 id=&quot;aware&quot;&gt;Aware&lt;/h2&gt;

&lt;p&gt;Aware’s &lt;em&gt;j2kdriver&lt;/em&gt; tool supports encoding, decoding and reformating of JP2 images. I used the following command-line in an attempt to create a lossy access image (note that he &lt;em&gt;-w&lt;/em&gt; switch sets the transformation to irreversible 9-7 wavelet, and the &lt;em&gt;-R&lt;/em&gt; switch sets the target compression ratio to 20:1):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;j2kdriver &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; balloon_master.jp2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-R&lt;/span&gt; 20 &lt;span class=&quot;nt&quot;&gt;-w&lt;/span&gt; &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    I97 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; JP2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; balloon_access_aw.jp2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This produced an output image that has the same size as the master! Similar to Kakadu’s &lt;em&gt;kdu_transcode&lt;/em&gt; tool, &lt;em&gt;j2kdriver&lt;/em&gt; makes no attempt at decoding and recompressing the source image in this case. However, the Aware tool does have a number of reformatting options, including one that allows you to discard quality layers. Needless to say, as the source image contains only one quality layer, this isn’t of much use in this case.&lt;/p&gt;

&lt;h2 id=&quot;optimising-the-archival-masters-for-access-generation&quot;&gt;Optimising the archival masters for access generation&lt;/h2&gt;

&lt;p&gt;In order to produce access images from our current archival masters, we would need to fully decode the source images and then recompress them. Even though this is perfectly possible (e.g. we could simply convert each JP2 to TIFF and then compress that back to lossy JP2), this is both awkward and computationally expensive. A more elegant approach would be to take advantage of JPEG 2000’s ability to include multiple quality layers. We’re already using quality layers in our existing access images, but this is mainly to optimise performance for access. However, we can also define quality layers in the preservation masters, and we can do this in such a way that a subset of all the quality layers in the master become equivalent to the access image. Access images can then be generated by simply discarding one or more quality layers in the preservation master, without any need for re-compressing the whole image. Visually, this results in the following situation:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2013/08/layers.png&quot; alt=&quot;Layers, master vs access&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is also the approach that Rob Buckley suggested in this &lt;a href=&quot;http://wellcomelibrary.org/assets/wtx056572.pdf&quot;&gt;2009 report for the Wellcome Library&lt;/a&gt;. 
In this case we have a losslessly compressed master with 11 quality layers. Access images at a 20:1 compression ratio can then be derived by simply discarding the highest 3 quality layers.&lt;/p&gt;

&lt;h2 id=&quot;making-it-work&quot;&gt;Making it work&lt;/h2&gt;

&lt;p&gt;To make this all work, I first optimised the specifications of the preservation masters by incorporating the quality layer definitions from our access specifications, adding 3 further quality layers to accommodate for the higher quality that is produced by lossless compression. I also added precinct definitions, since we’re using those for access as well. This resulted in the following profile:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Parameter&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Value&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File format&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;JP2 (JPEG 2000 Part 1)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Compression type&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Lossless (reversible 5-3 wavelet filter)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Colour transform&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Yes (only for colour images)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of decomposition levels&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Progression order&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;RPCL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Tile size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1024 x 1024&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Code block size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;64 x 64 (2&lt;sup&gt;6&lt;/sup&gt; x 2&lt;sup&gt;6&lt;/sup&gt;)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Precinct size&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;256 x 256 (2&lt;sup&gt;8&lt;/sup&gt;) for 2 highest resolution levels; 128 x 128 (2&lt;sup&gt;7&lt;/sup&gt;) for remaining resolution levels&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Number of quality layers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;11&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Target compression ratio layers&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2560:1 [1] ; 1280:1 [2] ;  640:1 [3] ; 320:1 [4] ; 160:1 [5] ; 80:1 [6] ; 40:1 [7] ; 20:1 [8] ; 10:1 [9] ; 5:1 [10] ; 2.5:1 [11]*&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Error resilience&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Start-of-packet headers; end-of-packet headers; segmentation symbols&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;Then I went back to that dreaded balloon image TIFF, and created a new lossless master that follows the optimised specifications (&lt;em&gt;balloon_master_layers_precincts.jp2&lt;/em&gt;).&lt;/p&gt;

&lt;h2 id=&quot;generating-the-access-image&quot;&gt;Generating the access image&lt;/h2&gt;

&lt;p&gt;Kakadu’s &lt;em&gt;kdu_transcode&lt;/em&gt; doesn’t allow you to explicitly discard quality layers, but the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-rate&lt;/code&gt; switch can be used to select an output bitrate, which has pretty much the same effect. So can can simply set all parameters to identical values as in our earlier example (remember that the 1.2 bitrate is equivalent to a compression ratio of 20:1 for an RGB image):&lt;/p&gt;

&lt;h3 id=&quot;kakadu-1&quot;&gt;Kakadu&lt;/h3&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;kdu_transcode &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; balloon_master_layers_precincts.jp2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; balloon_access_precincts_kdu.jpf &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-jpx_layers&lt;/span&gt; sRGB,0,1,2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nv&quot;&gt;Sprofile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;PROFILE2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-rate&lt;/span&gt; 1.2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In contrast to our earlier test, the resulting image has a very good quality. Note that using &lt;em&gt;kdu_transcode&lt;/em&gt; in this way produces output images that have the same number of quality layers as the source image (here: 11). However, in this case 3 are actually empty (i.e. the 4 highest quality layers are effectively identical). This is not a problem at all, it just means that progressive decoding of the image will result in an improved quality up to (and including) layer 8, with layers 9, 10 and 11 not adding anything on top of that.&lt;/p&gt;

&lt;h3 id=&quot;aware-1&quot;&gt;Aware&lt;/h3&gt;

&lt;p&gt;Aware works differently in that it allows you to define explicitly which quality layers must be included in the output image. To get an access image with 20:1 compression ratio, we need to include the 4th best quality layer and anything below it (i.e. discard the 3 highest quality layers), which is done with the following command:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;j2kdriver &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; balloon_master_layers_precincts.jp2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-ql&lt;/span&gt; 4 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-t&lt;/span&gt; JP2 &lt;span class=&quot;se&quot;&gt;\&lt;/span&gt;
    &lt;span class=&quot;nt&quot;&gt;-o&lt;/span&gt; balloon_access_layers_precincts_aw.jp2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that instead of decoding images by quality layer, it is also possible to do this by resolution level. This can be useful if derived images at a lower resolution are needed. Both Kakadu’s &lt;em&gt;kdu_transcode&lt;/em&gt; and Aware’s &lt;em&gt;j2kdriver&lt;/em&gt; application are capable of this, provided that the master images contain a sufficient number of resolution levels (which is controlled by the number of decomposition levels at the encoding stage).&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Careful selection of how JP2 preservation masters are generated can greatly facilitate the derivation of access images at a later stage. Tests with images that follow the KB’s current master specifications showed that lossy access images could only be derived by fully decoding and re-compressing them. Though not necessarily a problem, a more efficient approach would be to make better use of quality layers. This allows access images to be derived by simply extracting a subset of the master, without the need to decode or re-compress the source data. Tests with two widely used JPEG 2000 software toolkits (Kakadu and Aware) show that using this approach the process of deriving access images is both simple and efficient.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;Thanks go out to David Taubman, whose &lt;a href=&quot;http://tech.groups.yahoo.com/group/kakadu_jpeg2000/message/6777&quot;&gt;reply to some of my questions on Kakadu’s transcode tool&lt;/a&gt; was largely the impetus for this blog post.&lt;/p&gt;

&lt;h2 id=&quot;useful-links&quot;&gt;Useful links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/JP2AccessGeneration&quot;&gt;Dataset with all test images (Github link)&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://wellcomelibrary.org/assets/wtx056572.pdf&quot;&gt;Buckley &amp;amp; Tanner (2009): JPEG 2000 as a Preservation and Access Format for the Wellcome Trust Digital Library&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/08/19/optimising-archival-jp2s-derivation-access-copies/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;For most operational uses you would need to create a custom application using the full SDK. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2013/08/19/optimising-archival-jp2s-derivation-access-copies</link>
                <guid>https://bitsgalore.org/2013/08/19/optimising-archival-jp2s-derivation-access-copies</guid>
                <pubDate>2013-08-19T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Identification of PDF preservation risks with Apache Preflight&#58; the sequel</title>
                <description>&lt;p&gt;Last winter I started a first attempt at &lt;a href=&quot;/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression&quot;&gt;identifying preservation risks&lt;/a&gt; in PDF files using the &lt;em&gt;Apache Preflight&lt;/em&gt; PDF/A validator. This work was later followed up by others in two SPRUCE hackathons in Leeds (see &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2013-03-15-pdf-eh-another-hackathon-tale&quot;&gt;this blog post&lt;/a&gt; by Peter Cliff) and London (described &lt;a href=&quot;http://wiki.opf-labs.org/display/SPR/PDFA+Validation+tools+give+different+results&quot;&gt;here&lt;/a&gt;). Much of this later work tacitly assumes that &lt;em&gt;Apache Preflight&lt;/em&gt; is able to successfully identify features in PDF that are a potential risk for long-term access. This &lt;a href=&quot;http://wiki.opf-labs.org/display/SPR/PDFBox+Preflight+2++-+Uses+and+Abuses&quot;&gt;Wiki page on uses and abuses of Preflight&lt;/a&gt; (created as part of the final SPRUCE hackathon) even goes as far as stating that “&lt;em&gt;Preflight is thorough and unforgiving (as it should be)&lt;/em&gt;”. But what evidence do we have to support such claims? The only evidence that I’m aware of, are the results obtained from a &lt;a href=&quot;https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors&quot;&gt;small test corpus of custom-created PDFs&lt;/a&gt;. Each PDF in this corpus was created in such a way that it includes only one specific feature that is a potential preservation risk (e.g. encryption, non-embedded fonts, and so on). However, PDFs that exist ‘in the wild’ are usually more complex. Also, the PDF specification often allows you to implement similar features in subtly different ways. For these reasons, it is essential to obtain additional evidence of &lt;em&gt;Preflight&lt;/em&gt;’s ability to detect ‘risky’ features before relying on this tool in any operational setting.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;adobe-acrobat-engineering-test-files&quot;&gt;Adobe Acrobat Engineering test files&lt;/h2&gt;

&lt;p&gt;Shortly after I completed my initial tests, Adobe released the &lt;a href=&quot;http://acroeng.adobe.com/wp/&quot;&gt;Acrobat Engineering website&lt;/a&gt;, which contains a large volume of test documents that are used by Adobe for testing their products. Although the test documents are not fully annotated, they are subdivided into categories such as &lt;em&gt;Multimedia &amp;amp; 3D Tests&lt;/em&gt; and &lt;em&gt;Font tests&lt;/em&gt;. This makes these files particularly useful for additional tests on &lt;em&gt;Preflight&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;methodology&quot;&gt;Methodology&lt;/h2&gt;

&lt;p&gt;The general methodology I used to analyse these files is identical to what I did in my &lt;a href=&quot;https://zenodo.org/record/2556637&quot;&gt;2012 report&lt;/a&gt;: first, each PDF was validated using &lt;em&gt;Apache Preflight&lt;/em&gt;. As a control I also validated the PDFs with the &lt;em&gt;Preflight&lt;/em&gt; component of Adobe Acrobat, using the &lt;em&gt;PDF/A-1b&lt;/em&gt; profile. The table below lists the software versions used:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Software&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Version&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Apache Preflight&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;2.0.0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Adobe Acrobat&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10.14&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Acrobat Preflight&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;10.1.3 (090)&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;re-analysis-of-pdf-cabinet-of-horrors-corpus&quot;&gt;Re-analysis of PDF Cabinet of Horrors corpus&lt;/h2&gt;

&lt;p&gt;Because the current analysis is based on a more recent version of &lt;em&gt;Apache Preflight&lt;/em&gt; than the one used in the 2012 report (which was 1.8.0), I first re-ran the analysis of the PDFs in the &lt;a href=&quot;https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors&quot;&gt;PDF Cabinet of Horrors corpus&lt;/a&gt;. The main results are reproduced &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Portable+Document+Format&quot;&gt;here&lt;/a&gt;. The main differences with respect to that earlier version are:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;em&gt;Apache Preflight&lt;/em&gt; now has an option to produce output in &lt;em&gt;XML&lt;/em&gt; format (as &lt;a href=&quot;https://issues.apache.org/jira/browse/PDFBOX-1540&quot;&gt;suggested by William Palmer&lt;/a&gt; following the Leeds SPRUCE hackathon)&lt;/li&gt;
  &lt;li&gt;Better reporting of non-embedded fonts (see also &lt;a href=&quot;https://issues.apache.org/jira/browse/PDFBOX-1449&quot;&gt;this issue&lt;/a&gt;)&lt;/li&gt;
  &lt;li&gt;Unlike the earlier version, &lt;em&gt;Preflight&lt;/em&gt; 2.0.0 does not give any meaningful output in case of encrypted and password-protected PDFs! This is probably a bug, for which I submitted a report &lt;a href=&quot;https://issues.apache.org/jira/browse/PDFBOX-1659&quot;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;analysis-acrobat-engineering-pdfs&quot;&gt;Analysis Acrobat Engineering PDFs&lt;/h2&gt;

&lt;p&gt;Since the Acrobat Engineering site hosts a &lt;em&gt;lot&lt;/em&gt; of PDFs, I only focused on a limited subset for the current analysis:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;all files in the &lt;em&gt;General&lt;/em&gt; section of the &lt;a href=&quot;http://acroeng.adobe.com/wp/?page_id=101&quot;&gt;Font Testing&lt;/a&gt; category;&lt;/li&gt;
  &lt;li&gt;all files in the &lt;em&gt;Classic Multimedia&lt;/em&gt; section of the &lt;a href=&quot;http://acroeng.adobe.com/wp/?page_id=61&quot;&gt;Multimedia &amp;amp; 3D Tests&lt;/a&gt; category.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The results are summarized in two tables (see next sections). For each analysed PDF, the table lists:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the error(s) reported by Adobe Acrobat Preflight;&lt;/li&gt;
  &lt;li&gt;the error code(s) reported by Apache Preflight (see Preflight’s &lt;a href=&quot;http://svn.apache.org/repos/asf/pdfbox/trunk/preflight/src/main/java/org/apache/pdfbox/preflight/PreflightConstants.java&quot;&gt;source code&lt;/a&gt; for a listing of all possible error codes);&lt;/li&gt;
  &lt;li&gt;the error description(s) reported by Apache Preflight in the &lt;em&gt;details&lt;/em&gt; output element.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the sake of readability, the tables only list those error messages/codes that are directly related to font problems, multimedia, encryption and JavaScript. The full output for all tested files can be found &lt;a href=&quot;https://github.com/bitsgalore/apachePreflightAcroEng&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;fonts&quot;&gt;Fonts&lt;/h2&gt;

&lt;p&gt;The table below summarizes the results of the PDFs in the &lt;a href=&quot;http://acroeng.adobe.com/wp/?page_id=101&quot;&gt;Font Testing&lt;/a&gt; category:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Test file&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Acrobat Preflight error(s)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Apache Preflight Error Code(s)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Apache Preflight Details&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//EmbeddedCmap.pdf&quot;&gt;EmbeddedCmap.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font not embedded (and text rendering mode not 3) ; Glyphs missing in embedded font&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition, FontFile entry is missing from FontDescriptor for HeiseiKakuGo-W5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//TEXT.pdf&quot;&gt;TEXT.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font ; TrueType font has differences to standard encodings but is not a symbolic font; Wrong encoding for non-symbolic TrueType font&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.5; 3.1.1; 3.1.2; 3.1.3; 3.2.4&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition, The Encoding is invalid for the NonSymbolic TTF; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial,Italic &lt;em&gt;(repeated for other fonts)&lt;/em&gt;; Font damaged, The CharProcs references an element which can’t be read&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//Type3_WWW-HTML.PDF&quot;&gt;Type3_WWW-HTML.PDF&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition, The character with CID”58” should have a width equals to 15.56599 &lt;em&gt;(repeated for other fonts)&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//embedded_fonts.pdf&quot;&gt;embedded_fonts.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font not embedded (and text rendering mode not 3); Type 2 CID font: CIDToGIDMap invalid or missing&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.9; 3.1.11&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition; Invalid Font definition, The CIDSet entry is missing for the Composite Subset&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//embedded_pm65.pdf&quot;&gt;embedded_pm65.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.6&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition, Width of the character “110” in the font program “HKPLIB+AdobeCorpID-MyriadRg”is inconsistent with the width in the PDF dictionary &lt;em&gt;(repeated for other font)&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//notembedded_pm65.pdf&quot;&gt;notembedded_pm65.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition, FontFile entry is missing from FontDescriptor for TimesNewRoman &lt;em&gt;(repeated for other fonts)&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//printtestfont_nonopt.pdf&quot;&gt;printtestfont_nonopt.pdf&lt;/a&gt;*&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space;ICC profile uses invalid type&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//printtestfont_opt.pdf&quot;&gt;printtestfont_opt.pdf&lt;/a&gt;*&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;ICC profile is not valid; ICC profile is version 4.0 or newer; ICC profile uses invalid color space; ICC profile uses invalid type&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;Preflight throws exception (exceptionThrown), exits with message ‘Invalid ICC Profile Data’&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//substitution_fonts.pdf&quot;&gt;substitution_fonts.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font not embedded (and text rendering mode not 3)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.1; 3.1.2; 3.1.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Souvenir-Light &lt;em&gt;(repeated for other fonts)&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/fonts//text_images_pdf1.2.pdf&quot;&gt;text_images_pdf1.2.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Width information for rendered glyphs is inconsistent&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;3.1.1; 3.1.2&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;* &lt;em&gt;As this document doesn’t appear to have any font-related issues it’s unclear why it is in the Font Testing category. Errors related to ICC profiles reproduced here because of relevance to Apache Preflight exception.&lt;/em&gt;&lt;/p&gt;

&lt;h2 id=&quot;general-observations&quot;&gt;General observations&lt;/h2&gt;

&lt;p&gt;An intercomparison between the results of Acrobat Preflight and Apache Preflight shows that Apache Preflight’s output may vary in case of non-embedded fonts. In most cases it produces error code 3.1.3 (as was the case with the &lt;em&gt;PDF Cabinet of Horrors&lt;/em&gt; dataset), but other errors in the 3.1.x range may occur as well. The 3.1.6 “character width” error is something that was also encountered during the &lt;a href=&quot;http://wiki.opf-labs.org/display/SPR/PDFA+Validation+tools+give+different+results&quot;&gt;London SPRUCE Hackathon&lt;/a&gt;, and according to the information &lt;a href=&quot;https://groups.google.com/forum/#!topic/pdfnet-sdk/L2osfwaap98&quot;&gt;here&lt;/a&gt; this is most likely the result of the PDF/A specification not being particularly clear. So, this looks like a non-serious error that can be safely ignored in most cases.&lt;/p&gt;

&lt;h2 id=&quot;multimedia&quot;&gt;Multimedia&lt;/h2&gt;

&lt;p&gt;The next table shows the results for &lt;a href=&quot;http://acroeng.adobe.com/wp/?page_id=61&quot;&gt;Multimedia &amp;amp; 3D Tests&lt;/a&gt; category:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Test file&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Acrobat Preflight error(s)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Apache Preflight Error Code(s)&lt;/th&gt;
      &lt;th style=&quot;text-align: left&quot;&gt;Apache Preflight Details&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//20020402_CALOS.pdf&quot;&gt;20020402_CALOS.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;-&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//Disney-Flash.pdf&quot;&gt;Disney-Flash.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field does not have appearance dict; Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//Jpeg_linked.pdf&quot;&gt;Jpeg_linked.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Document is encrypted; Encrypt key present in file trailer; Named action with a value other than standard page navigation used; Incorrect annotation type used (not allowed in PDF/A); Font not embedded (and text rendering mode not 3)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//MultiMedia_Acro6.pdf&quot;&gt;MultiMedia_Acro6.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Document is encrypted; EmbeddedFiles entry in Names dictionary; Encrypt key present in file trailer; PDF contains EF (embedded file) entry; Incorrect annotation type used (not allowed in PDF/A)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//MusicalScore.pdf&quot;&gt;MusicalScore.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;CIDset in subset font is incomplete; CIDset in subset font missing; Contains action of type JavaScript; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry; Type 2 CID font: CIDToGIDMap invalid or missing&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//SVG-AnnotAnim.pdf&quot;&gt;SVG-AnnotAnim.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.2.1; 1.2.9&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Forbidden field in an annotation definition, The subtype isn’t authorized : SVG; Body Syntax error, EmbeddedFile entry is present in a FileSpecification dictionary&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//SVG.pdf&quot;&gt;SVG.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; PDF contains EF (embedded file) entry&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//ScriptEvents.pdf&quot;&gt;ScriptEvents.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//Service%20Form_media.pdf&quot;&gt;Service Form_media.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains action of type JavaScript; Contains action of type ResetForm; Document contains JavaScripts; Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Glyphs missing in embedded font; Incorrect annotation type used (not allowed in PDF/A); Named action with a value other than standard page navigation used; PDF contains EF (embedded file) entry&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//Trophy.pdf&quot;&gt;Trophy.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Contains action of type JavaScript; Document contains JavaScripts; Font not embedded (and text rendering mode not 3); Form field has actions; Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//VolvoS40V50-Full.pdf&quot;&gt;VolvoS40V50-Full.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Preflight exits with: “An error occurred  while parsing a contents stream. Unable to analyze the PDF file”&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//gXsummer2004-stream.pdf&quot;&gt;gXsummer2004-stream.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;File cannot be loaded in Acrobat (&lt;em&gt;damaged file)&lt;/em&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//phlmapbeta7.pdf&quot;&gt;phlmapbeta7.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Document contains additional actions (AA); Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A); PDF contains EF (embedded file) entry&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/classic_multimedia//us_population.pdf&quot;&gt;us_population.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Preflight exits with: “An error occurred  while parsing a contents stream. Unable to analyze the PDF file”&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;1.0; 1.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;em&gt;No multimedia, font or encryption-related errors; Preflight did report syntax and body syntax error&lt;/em&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/movie//movie.pdf&quot;&gt;movie.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Incorrect annotation type used (not allowed in PDF/A)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Forbidden field in an annotation definition, The subtype isn’t authorized : Movie&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/movie//movie_down1.pdf&quot;&gt;movie_down1.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Incorrect annotation type used (not allowed in PDF/A)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.2.1&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Forbidden field in an annotation definition, The subtype isn’t authorized : Movie&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;&lt;a href=&quot;http://acroeng.adobe.com/Test_Files/movie//remotemovieurl.pdf&quot;&gt;remotemovieurl.pdf&lt;/a&gt;&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Font not embedded (and text rendering mode not 3); Incorrect annotation type used (not allowed in PDF/A)&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;5.2.1; 3.1.1; 3.1.2; 3.1.3&lt;/td&gt;
      &lt;td style=&quot;text-align: left&quot;&gt;Forbidden field in an annotation definition, The subtype isn’t authorized : Movie; Invalid Font definition, Some required fields are missing from the Font dictionary; Invalid Font definition, FontDescriptor is null or is a AFM Descriptor; Invalid Font definition, FontFile entry is missing from FontDescriptor for Arial&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;general-observations-1&quot;&gt;General observations&lt;/h2&gt;

&lt;p&gt;The results from the &lt;em&gt;Multimedia&lt;/em&gt; PDFs are interesting for several reasons. First of all, these files include a wide variety of ‘risky’ features, such as multimedia content, embedded files, JavaScript, non-embedded fonts and encryption. These were successfully identified by &lt;em&gt;Acrobat Preflight&lt;/em&gt; in most cases. &lt;em&gt;Apache Preflight&lt;/em&gt;, on the other hand, only reported non-specific and fairly uninformative errors (1.0 + 1.2.1) for 12 out of 17 files. Even though &lt;em&gt;Preflight&lt;/em&gt; was correct in establishing that these files were not valid PDF/A-1b, it wasn’t able to drill down to the level of specific features for the majority of these files.&lt;/p&gt;

&lt;p&gt;Looking more into detail at those 1.0 and 1.2.1 errors, the detailed description of most of them is:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt; Syntax error, Expected pattern &apos;obj but missed at character &apos;o&apos;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;To me it looks like &lt;em&gt;Preflight&lt;/em&gt; doesn’t correctly parse the binary structure of the PDF. Opening a few of the problematic PDFs revealed that the object identifiers in these files were followed &lt;em&gt;immediately&lt;/em&gt; by the object contents, e.g:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;32 0 obj&amp;lt;&amp;lt;/Kids[33 0 R]&amp;gt;&amp;gt;
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;whereas more commonly they are separated by a line terminator, like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;32 0 obj
&amp;lt;&amp;lt;/Kids[33 0 R]&amp;gt;&amp;gt;
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;As far as I’m aware neither the PDF specification nor PDF/A have anything to say about line endings in this case, so my best guess is that this is simply a bug that results in the file not being fully parsed. I submitted a bug report for this issue &lt;a href=&quot;https://issues.apache.org/jira/browse/PDFBOX-1674&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;summary-and-conclusions&quot;&gt;Summary and conclusions&lt;/h2&gt;

&lt;p&gt;The re-analysis of the PDF Cabinet of Horrors corpus, and the subsequent analysis of a sub-set of the Adobe Acrobat Engineering PDFs shows a number of things. First, &lt;em&gt;Apache Preflight&lt;/em&gt; 2.0.0 does not properly identify encryption and password-protection. This looks like a bug that is probably easily fixed. Second, the analysis of the  &lt;em&gt;Font Testing&lt;/em&gt; PDFs from the Acrobat Engineering site revealed that non-embedded fonts may result in a variety of error codes in &lt;em&gt;Apache Preflight&lt;/em&gt; (assuming here that the &lt;em&gt;Acrobat Preflight&lt;/em&gt; results are accurate). So, when using &lt;em&gt;Apache Preflight&lt;/em&gt; to check font embedding, it’s probably a good idea to treat all font-related errors (perhaps with the exception of character width errors) as a potential risk.  The more complex PDFs in the &lt;em&gt;Multimedia&lt;/em&gt; category proved to be quite challenging to &lt;em&gt;Apache Preflight&lt;/em&gt;: for most files here, it was not able to identify &lt;em&gt;specific features&lt;/em&gt; such as multimedia content, embedded files, JavaScript and non-embedded fonts. A cursory analysis of some of the failed files suggests that this is probably a bug that results in &lt;em&gt;Apache Preflight&lt;/em&gt; not being able to parse the file structure correctly. Keeping in mind that he specificity of &lt;em&gt;Preflight&lt;/em&gt;’s validation output already improved considerably since version 1.8.0, a fix of both this issue and the encryption problem would probably result in another significant improvement. In the meantime, it’s important to keep the expectations about the tool’s capabilities realistic, in order to avoid some potential unintended misuses.&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/bitsgalore/apachePreflightAcroEng&quot;&gt;Full Acrobat Preflight  and Apache Preflight output for all tested files (Github)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Portable+Document+Format&quot;&gt;Portable Document Format on OPF File Format Risk Registry&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://acroeng.adobe.com/wp/&quot;&gt;Adobe Acrobat Engineering website&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/2556637&quot;&gt;Identification of PDF preservation risks with Apache Preflight: a first impression&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://wiki.opf-labs.org/display/SPR/PDFBox+Preflight+2++-+Uses+and+Abuses&quot;&gt;PDFBox Preflight 2 - Uses and Abuses (OPF Wiki)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/07/25/identification-pdf-preservation-risks-sequel/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2013/07/25/identification-pdf-preservation-risks-sequel</link>
                <guid>https://bitsgalore.org/2013/07/25/identification-pdf-preservation-risks-sequel</guid>
                <pubDate>2013-07-25T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>ICC profiles and resolution in JP2&#58; update on 2011 D-Lib paper</title>
                <description>&lt;p&gt;It’s been more than two years now since I wrote my D-Lib paper &lt;a href=&quot;http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html&quot;&gt;&lt;em&gt;JPEG 2000 for Long-term Preservation: JP2 as a Preservation Format&lt;/em&gt;&lt;/a&gt;. From time to time people ask me about the status of the issues that are mentioned in that paper, so here’s a long overdue update.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;issues-addressed-in-the-2011-paper&quot;&gt;Issues addressed in the 2011 paper&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html&quot;&gt;D-Lib paper&lt;/a&gt; mainly focused on two problems with the (then-current version of the) &lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;JP2 format specification&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The specification was overly restrictive on the embedding of ICC profiles. By only allowing &lt;em&gt;input&lt;/em&gt; profiles, this ruled out the use of &lt;em&gt;display&lt;/em&gt; profiles. In practice this meant that widely-used working colour spaces such as &lt;a href=&quot;http://www.adobe.com/digitalimag/pdfs/AdobeRGB1998.pdf&quot;&gt;&lt;em&gt;Adobe RGB&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;http://www.eci.org/doku.php?id=en:colourstandards:workingcolorspaces&quot;&gt;&lt;em&gt;eciRGB 2&lt;/em&gt;&lt;/a&gt; could not be used in &lt;em&gt;JP2&lt;/em&gt; without violating the standard.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;JP2&lt;/em&gt; makes a distinction between &lt;em&gt;capture&lt;/em&gt; resolution and &lt;em&gt;default display&lt;/em&gt; resolution, which are stored in two designated sets of header fields. However, the specification was not clear in which case either set of fields should be used.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This lead to the situation that not all software products were interpreting the specification in the same way. For instance, some encoders would (silently) produce files in &lt;a href=&quot;http://fileformats.archiveteam.org/wiki/JPX&quot;&gt;&lt;em&gt;JPX&lt;/em&gt;&lt;/a&gt; format whenever they encountered an input image with an embedded  &lt;em&gt;display&lt;/em&gt; ICC profile. Other encoders would embed the profile, changing the profile class in the process, whereas others yet would ignore the limitation altogether and embed the profile without complaining. Similarly, some software products would only write (and read) the &lt;em&gt;capture&lt;/em&gt; resolution fields (while ignoring any &lt;em&gt;default display&lt;/em&gt; ones), whereas the opposite was true for other products. This in turn raised various interoperability issues, many of which are potential risks in a long-term preservation context.&lt;/p&gt;

&lt;p&gt;The 2011 paper concluded that:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;[t]hese issues could be remedied by some small adjustments of JP2’s format specification, which would create minimal backward compatibility problems, if any at all.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2 id=&quot;amendment-to-the-standard&quot;&gt;Amendment to the standard&lt;/h2&gt;

&lt;p&gt;So, enter the &lt;a href=&quot;http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=59863&quot;&gt;amendment “&lt;em&gt;Updated ICC profile support and resolution clarification&lt;/em&gt;”&lt;/a&gt;, which was published by ISO earlier this year. This amendment remedies the above issues by applying the following changes to the &lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;existing JP2 format specification&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;The &lt;em&gt;Restricted ICC profile&lt;/em&gt; method now permits the use of &lt;em&gt;display&lt;/em&gt; profiles (previously only &lt;em&gt;input&lt;/em&gt; profiles were allowed). The other restrictions (e.g. that ICC profiles should be of either the Monochrome or the Three-Component Matrix-Based type) remain unchanged.&lt;/li&gt;
  &lt;li&gt;It is more specific about the intended uses of the &lt;em&gt;capture&lt;/em&gt; and the &lt;em&gt;default display&lt;/em&gt; resolution boxes. Of particular interest here is that &lt;em&gt;capture&lt;/em&gt; resolution now reflects the resolution at which the image samples were “captured or created”. Previously the word “digitized” was used, which ruled out the case of born-digital materials. The use of the &lt;em&gt;default display&lt;/em&gt; resolution box is also further clarified. In practice this means that the &lt;em&gt;capture&lt;/em&gt; resolution is pretty much equivalent to the &lt;em&gt;XResolution&lt;/em&gt; /&lt;em&gt;YResolution&lt;/em&gt; fields in TIFF.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note that the full amendment text is only available after purchase at ISO (previously an earlier draft was available for free, but apparently it was taken down recently).&lt;/p&gt;

&lt;h2 id=&quot;implementation-of-changes&quot;&gt;Implementation of changes&lt;/h2&gt;

&lt;p&gt;Since a standard isn’t worth much unless it is used, let’s have a quick look at the three most popular JPEG 2000 implementations (a more elaborate overview is available &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Handling+of+ICC+profiles&quot;&gt;here&lt;/a&gt; and &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Resolution+not+in+expected+header+fields&quot;&gt;here&lt;/a&gt; at the &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/OPF+File+Format+Risk+Registry&quot;&gt;OPF File Format Risk Registry&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Recent versions of &lt;strong&gt;Kakadu&lt;/strong&gt;’s &lt;em&gt;kdu_compress&lt;/em&gt; are now able to correctly handle &lt;em&gt;ICC&lt;/em&gt; profiles. Compressing a &lt;em&gt;TIFF&lt;/em&gt; that contains an &lt;em&gt;ICC&lt;/em&gt; profile that meets the (updated) &lt;em&gt;restricted ICC profile&lt;/em&gt; definition now produces a &lt;em&gt;JP2&lt;/em&gt; with the profile correctly embedded. Moreover, &lt;em&gt;kdu_compress&lt;/em&gt; now uses the &lt;em&gt;capture&lt;/em&gt; resolution fields to store the image’s resolution (as derived from the &lt;em&gt;TIFF&lt;/em&gt;). Previously, the &lt;em&gt;Kakadu&lt;/em&gt; demo applications were using the &lt;em&gt;default display&lt;/em&gt; fields instead, which resulted in various interoperability issues, because most other decoders/encoders were using the &lt;em&gt;capture&lt;/em&gt; fields. This is all solved in the latest version.&lt;/p&gt;

&lt;p&gt;By default &lt;strong&gt;Aware&lt;/strong&gt; and &lt;strong&gt;Luratech&lt;/strong&gt; already used the &lt;em&gt;capture&lt;/em&gt; resolution fields back in 2011, and this behaviour is now consistent with the updated standard. As for &lt;em&gt;ICC&lt;/em&gt; profiles, &lt;em&gt;Aware&lt;/em&gt; accepted &lt;em&gt;display&lt;/em&gt; profiles without complaining in my 2011 tests, and with the amendment in effect, these images are now also valid &lt;em&gt;JP2&lt;/em&gt;. &lt;em&gt;Luratech&lt;/em&gt; used to handle &lt;em&gt;display&lt;/em&gt; profiles by changing the profile class field tom &lt;em&gt;input&lt;/em&gt;. That was in 2011, and I don’t know if anything has changed since then, but then again this behaviour never caused by problems in the first place.&lt;/p&gt;

&lt;h2 id=&quot;round-up-and-conclusions&quot;&gt;Round-up and conclusions&lt;/h2&gt;

&lt;p&gt;The amendment to JP2 fixes the previous shortcomings that were mentioned in my 2011 D-Lib paper. Moreover, the behaviour of the three most popular (commercial) JPEG 2000 implementations now closely follows the updated specification, which should minimise any interoperability problems related to ICC profiles and resolution.&lt;/p&gt;

&lt;h2 id=&quot;links&quot;&gt;Links&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html&quot;&gt;JPEG 2000 for Long-term Preservation: JP2 as a Preservation Format (D-Lib)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;JP2 format specification (2004 version)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=59863&quot;&gt;ISO/IEC 15444-1:2004/Amd 6:2013. Updated ICC profile support and resolution clarification&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Handling+of+ICC+profiles&quot;&gt;Handling of ICC Profiles by JPEG 2000 encoders (OPF File Format Risk Registry)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Resolution+not+in+expected+header+fields&quot;&gt;Handling of resolution fields by JPEG 2000 encoders (OPF File Format Risk Registry)&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/07/01/icc-profiles-and-resolution-jp2-update-2011-d-lib-paper/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2013/07/01/icc-profiles-and-resolution-jp2-update-2011-d-lib-paper</link>
                <guid>https://bitsgalore.org/2013/07/01/icc-profiles-and-resolution-jp2-update-2011-d-lib-paper</guid>
                <pubDate>2013-07-01T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>EPUB for archival preservation&#58; an update</title>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;Last year (2012) the KB released a &lt;a href=&quot;https://zenodo.org/record/839711&quot;&gt;report on the suitability of the &lt;em&gt;EPUB&lt;/em&gt; format for archival preservation&lt;/a&gt;. A substantial number of &lt;em&gt;EPUB&lt;/em&gt;-related developments have happened since then, and as a result some of the report’s findings and conclusions have become outdated. This applies in particular to the observations on &lt;em&gt;EPUB&lt;/em&gt; 3, and the support of &lt;em&gt;EPUB&lt;/em&gt; by characterisation tools. This blog post provides an update to those findings. It addresses the following topics in particular:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Use of &lt;em&gt;EPUB&lt;/em&gt; in scholarly publishing&lt;/li&gt;
  &lt;li&gt;Adoption and use of &lt;em&gt;EPUB&lt;/em&gt; 3&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;EPUB&lt;/em&gt; 3 reader support&lt;/li&gt;
  &lt;li&gt;Support of &lt;em&gt;EPUB&lt;/em&gt; by characterisation tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the following sections I will briefly summarise the main developments in each of these areas, after which I will wrap up things in a concluding section.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;use-of-epub-in-scholarly-publishing&quot;&gt;Use of &lt;em&gt;EPUB&lt;/em&gt; in scholarly publishing&lt;/h2&gt;

&lt;p&gt;Although scholarly publishing is still dominated by &lt;em&gt;PDF&lt;/em&gt;, the use of &lt;em&gt;EPUB&lt;/em&gt; in this sector is on the rise. &lt;a href=&quot;http://scholarlykitchen.sspnet.org/2013/03/19/is-it-time-for-scholarly-journal-publishers-to-begin-distributing-articles-using-epub-3/&quot;&gt;This blog post by Todd Carpenter&lt;/a&gt; gives the following examples:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;BioMed Central is one of the first publishers that started offering their scholarly publications in &lt;em&gt;EPUB&lt;/em&gt; format (as explained in this &lt;a href=&quot;http://blogs.biomedcentral.com/bmcblog/2012/12/11/biomed-central-now-publishes-in-epub-format/&quot;&gt;December 2012 blog post&lt;/a&gt;). One of the journals that are available in &lt;em&gt;EPUB&lt;/em&gt; format is the  &lt;a href=&quot;http://www.jneuroinflammation.com/content&quot;&gt;Journal of Neuroinflammation&lt;/a&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.hindawi.com/epub/&quot;&gt;Hindawi Publishing Corporation recently added &lt;em&gt;EPUB&lt;/em&gt;&lt;/a&gt; as one of the available formats for all of their journal and book publications.&lt;/li&gt;
  &lt;li&gt;Lippincott Williams &amp;amp; Wilkins (which is part of Kluwer) also offers &lt;a href=&quot;http://journals.lww.com/pages/results.aspx?txtKeywords=epub&quot;&gt;some of their titles&lt;/a&gt; in &lt;em&gt;EPUB&lt;/em&gt; format.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At the time of writing, the above publishers are all using &lt;em&gt;EPUB&lt;/em&gt; 2.&lt;/p&gt;

&lt;h2 id=&quot;adoption-and-use-of-epub-3&quot;&gt;Adoption and use of &lt;em&gt;EPUB&lt;/em&gt; 3&lt;/h2&gt;

&lt;p&gt;Over the last year a number of organisations that are representing the publishing industry have expressed their support of &lt;em&gt;EPUB&lt;/em&gt; 3.&lt;/p&gt;

&lt;p&gt;The &lt;a href=&quot;http://www.bisg.org&quot;&gt;Book Industry Study Group (BISG)&lt;/a&gt; is a trade association for &lt;a href=&quot;http://www.bisg.org/directory/&quot;&gt;companies in  the publishing industry&lt;/a&gt;. Last year (August 2012) BISG released a &lt;a href=&quot;http://www.bisg.org/what-we-do-4-155-pol-1201-endorsement-of-epub-3.php&quot;&gt;policy statement&lt;/a&gt; in which it endorsed “&lt;em&gt;EPUB 3 as the accepted and preferred standard for representing, packaging, and encoding structured and semantically enhanced Web content — including XHTML, CSS, SVG, images, and other resources — for distribution in a single-file format&lt;/em&gt;”.&lt;/p&gt;

&lt;p&gt;Early this year (March 2013) the &lt;a href=&quot;http://www.internationalpublishers.org/&quot;&gt;International Publishers Association (IPA)&lt;/a&gt; issued a &lt;a href=&quot;http://www.internationalpublishers.org/images/stories/PR/2013/epub3pr_final.pdf&quot;&gt;press release&lt;/a&gt; that also endorsed &lt;em&gt;EPUB&lt;/em&gt; 3 as a “&lt;em&gt;preferred standard format for representing HTML and other web content for distribution as single-file publications&lt;/em&gt;”. IPA represents over 60 national publishing organisations from more than 50 countries.&lt;/p&gt;

&lt;p&gt;Finally, the &lt;a href=&quot;http://www.europeanbooksellers.eu/&quot;&gt;European Booksellers Federation&lt;/a&gt; recently released a &lt;a href=&quot;http://www.europeanbooksellers.eu/positionpaper/interoperability-e-books-formats&quot;&gt;report on the interoperability of eBook Formats&lt;/a&gt;. Its authors did a comparison of the features and functionality provided by &lt;em&gt;EPUB&lt;/em&gt; 3, Amazon’s &lt;a href=&quot;https://en.wikipedia.org/wiki/Amazon_Kindle#Proprietary_formats_.28AZW.2C_KF8.29&quot;&gt;KF8 (Kindle)&lt;/a&gt; and Apple’s e-book formats. They concluded that &lt;em&gt;EPUB&lt;/em&gt; 3 “&lt;em&gt;clearly covers the superset of the expressive abilities of all the formats&lt;/em&gt;”, and that there is “&lt;em&gt;no technical or functional reason not to use and establish EPUB 3 as an/the interoperable (open) ebook format standard&lt;/em&gt;”. This all suggests that &lt;em&gt;EPUB&lt;/em&gt; 3 is widely supported by the publishing industry.&lt;/p&gt;

&lt;p&gt;Having said that, the actual use of &lt;em&gt;EPUB&lt;/em&gt; 3 is still limited at this stage, even though some publishers have already started using the format. Earlier this year technical publisher O’Reilly started releasing &lt;a href=&quot;http://toc.oreilly.com/2013/02/oreillys-journey-to-epub-3.html&quot;&gt;all their new eBook bundles in &lt;em&gt;EPUB&lt;/em&gt; 3 format&lt;/a&gt;. The announcement mentions that their backlist will be updated as well. Interestingly, they decided to create “hybrid” &lt;em&gt;EPUB&lt;/em&gt;s that are backward-compatible with &lt;em&gt;EPUB&lt;/em&gt; 2. In November 2012 publisher Hachette also &lt;a href=&quot;http://www.digitalbookworld.com/2012/hachette-launches-epub3-program-committed-to-the-format/&quot;&gt;announced the launch of their &lt;em&gt;EPUB 3&lt;/em&gt; program&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;epub-3-reader-support&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt; 3 reader support&lt;/h2&gt;

&lt;p&gt;At this time reader support for &lt;em&gt;EPUB&lt;/em&gt; 3 is still limited, but there have been a number of significant developments since the second half of 2012:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.apple.com/apps/ibooks/&quot;&gt;Apple iBooks&lt;/a&gt; started support of &lt;em&gt;EPUB&lt;/em&gt; 3 during the course of 2012, which means that the format can be read on Apple mobile devices (e.g. &lt;em&gt;iPad&lt;/em&gt;).&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://azardi.infogridpacific.com/&quot;&gt;Azardi Desktop&lt;/a&gt; is a viewer that supports most features of &lt;em&gt;EPUB&lt;/em&gt; 3. It is available for Windows, Mac and Linux.&lt;/li&gt;
  &lt;li&gt;Helicon is an ebook production house that &lt;a href=&quot;http://www.digitalbookworld.com/2012/helicon-books-claims-first-e-reader-for-android-with-full-epub3-support/&quot;&gt;claimed to have developed the first reading application for Android that fully supports &lt;em&gt;EPUB&lt;/em&gt; 3&lt;/a&gt; in December 2012.&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://readium.org/&quot;&gt;Readium&lt;/a&gt; is an open-source set of libraries for viewing and creating &lt;em&gt;EPUB&lt;/em&gt; 2 and &lt;em&gt;EPUB&lt;/em&gt; 3 content. The project has already produced a &lt;a href=&quot;https://chrome.google.com/webstore/detail/empty-title/fepbnnnkkadjhjahcafoaglimekefifl?hl=en&quot;&gt;Google Chrome extension&lt;/a&gt; that allows you to view &lt;em&gt;EPUB&lt;/em&gt; content in the browser. Since March 2013, Readium is backed by the &lt;a href=&quot;http://readium.org/readium-foundation-announced&quot;&gt;Readium Foundation&lt;/a&gt;, which is a consortium of (currently 27) &lt;a href=&quot;http://readium.org/membership&quot;&gt;companies&lt;/a&gt; that are active in digital publishing. Current work focuses on the development of &lt;a href=&quot;http://readium.org/projects/readium-sdk&quot;&gt;Readium SDK&lt;/a&gt;, a high-performance Software Development Kit that is optimised for mobile devices.&lt;/li&gt;
  &lt;li&gt;E-Reader manufacturer Kobo &lt;a href=&quot;http://www.digitalbookworld.com/2012/kobo-to-fully-support-epub-3-by-third-quarter-2013/&quot;&gt;announced&lt;/a&gt; that it aims to offer full support of &lt;em&gt;EPUB&lt;/em&gt; 3 by the third quarter of 2013.&lt;/li&gt;
  &lt;li&gt;According to &lt;a href=&quot;http://goodereader.com/blog/electronic-readers/the-conundrum-of-digital-publishing-html5-or-epub-3/&quot;&gt;this piece&lt;/a&gt;, Barnes and Noble also said they will support &lt;em&gt;EPUB&lt;/em&gt; 3 sometime in  2013.&lt;/li&gt;
  &lt;li&gt;In February Sony &lt;a href=&quot;http://goodereader.com/blog/electronic-readers/sony-reader-for-android-updated-to-support-epub-3/&quot;&gt;added support for &lt;em&gt;EPUB&lt;/em&gt; 3 to its Reader app for Google Android&lt;/a&gt;. Since Sony is a member of the &lt;a href=&quot;http://readium.org/readium-foundation-announced&quot;&gt;Readium Foundation&lt;/a&gt;, it is to be expected that &lt;em&gt;EPUB&lt;/em&gt; 3 support for their e-readers will follow as well (although the company hasn’t made any official statement on this).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;support-of-epub-by-characterisation-tools&quot;&gt;Support of &lt;em&gt;EPUB&lt;/em&gt; by characterisation tools&lt;/h2&gt;

&lt;p&gt;The 2012 report concluded that &lt;em&gt;EPUB&lt;/em&gt; was not optimally supported by characterisation tools. This situation has improved quite a lot since that time.&lt;/p&gt;

&lt;h3 id=&quot;identification&quot;&gt;Identification&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;EPUB&lt;/em&gt; is now &lt;a href=&quot;http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&amp;amp;id=1270&quot;&gt;included in &lt;em&gt;PRONOM&lt;/em&gt;&lt;/a&gt;, and has a corresponding &lt;a href=&quot;http://www.nationalarchives.gov.uk/information-management/projects-and-work/droid.htm&quot;&gt;DROID&lt;/a&gt; signature. This means that &lt;a href=&quot;http://fido.openpreservation.org/&quot;&gt;&lt;em&gt;Fido&lt;/em&gt;&lt;/a&gt; should now be able to identify the format as well. On a side note, PRONOM doesn’t differentiate between &lt;em&gt;EPUB&lt;/em&gt; 2 and 3, and it appears that the current record (which is only an outline record anyway) either combines both versions, or only refers to &lt;em&gt;EPUB&lt;/em&gt; 2. PRONOM should probably be more specific on this.&lt;/p&gt;

&lt;h3 id=&quot;validation-and-feature-extraction&quot;&gt;Validation and feature extraction&lt;/h3&gt;

&lt;p&gt;The 2012 report included tests of 2 &lt;em&gt;EPUB&lt;/em&gt; validator tools: &lt;a href=&quot;http://code.google.com/p/epubcheck/&quot;&gt;&lt;em&gt;epubcheck&lt;/em&gt;&lt;/a&gt; and &lt;a href=&quot;http://code.google.com/p/flightcrew/&quot;&gt;flightcrew&lt;/a&gt;. While testing &lt;em&gt;epubcheck&lt;/em&gt; in 2012, I was’t entirely happy with the rather unstructured output that the tool produced. Also, I couldn’t find &lt;em&gt;any&lt;/em&gt; tool that was capable of extracting technical meta-information about an &lt;em&gt;EPUB&lt;/em&gt;, like the presence of encryption or other digital rights management technology (feature extraction). Happily, starting with version 3.0 &lt;em&gt;epubcheck&lt;/em&gt; is capable of extracting this kind of information. Moreover, it added an option to &lt;a href=&quot;http://code.google.com/p/epubcheck/wiki/Extraction&quot;&gt;report its output in structured &lt;em&gt;XML&lt;/em&gt; format&lt;/a&gt; that follows the &lt;a href=&quot;http://sourceforge.net/projects/jhove/&quot;&gt;&lt;em&gt;JHOVE&lt;/em&gt;&lt;/a&gt; schema. I haven’t done any elaborate testing, but a quick run on some of &lt;a href=&quot;http://code.google.com/p/epub-samples/&quot;&gt;these &lt;em&gt;EPUB&lt;/em&gt; 3 samples&lt;/a&gt; showed that &lt;em&gt;epubcheck&lt;/em&gt; was able to identify font obfuscation, in which case a property &lt;em&gt;hasEncryption&lt;/em&gt; (value &lt;em&gt;true&lt;/em&gt;) is reported. I wasn’t able to find any &lt;em&gt;EPUB&lt;/em&gt; files with &lt;em&gt;DRM&lt;/em&gt;, so I cannot confirm if &lt;em&gt;epubcheck&lt;/em&gt; detects this as well.&lt;/p&gt;

&lt;h3 id=&quot;flightcrew&quot;&gt;Flightcrew&lt;/h3&gt;

&lt;p&gt;As for &lt;em&gt;flightcrew&lt;/em&gt;, no new versions of that tool have been released since August 2011, and it looks like it is not under any active development.&lt;/p&gt;

&lt;h2 id=&quot;discussion-and-conclusions&quot;&gt;Discussion and conclusions&lt;/h2&gt;

&lt;p&gt;Since the release of the &lt;a href=&quot;https://zenodo.org/record/839711&quot;&gt;KB report on the suitability of &lt;em&gt;EPUB&lt;/em&gt; for archival preservation&lt;/a&gt; the &lt;em&gt;EPUB&lt;/em&gt; landscape has changed rather a lot. First, a number of academic publishers have started to offer scholarly content in this format. Although &lt;em&gt;EPUB&lt;/em&gt; 3 is still in its early stages, various organisations representing the publishing industry have explicitly expressed their support of &lt;em&gt;EPUB&lt;/em&gt; 3. A number of software applications now exist that are able to read the format, and work on a high-performance open source &lt;em&gt;EPUB&lt;/em&gt; 3 Software Development Kit is backed by major players in the digital publishing industry (including e-reader manufacturers such as Kobo and Sony). &lt;em&gt;EPUB&lt;/em&gt; support by characterisation tools has improved as well, mostly thanks to a number of recent enhancements of &lt;em&gt;epubcheck&lt;/em&gt;. So, overall, &lt;em&gt;EPUB&lt;/em&gt;’s credentials as a preservation format appear to have improved quite a bit over the last year. In the case of &lt;em&gt;EPUB&lt;/em&gt; 3 it’s still too early to say anything about actual adoption, but the conditions for adoption to happen look pretty favourable. This is something I will get back to in my next update, perhaps in another year from now.&lt;/p&gt;

&lt;h2 id=&quot;useful-links&quot;&gt;Useful links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://idpf.org/&quot;&gt;International Digital Publishing Forum &lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://code.google.com/p/epub-samples/&quot;&gt;EPUB 3 Sample Documents&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://azardi.infogridpacific.com/resources.html&quot;&gt;Fixed Layout (FLO) Demonstration ePub3 Documents by Azardi&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://code.google.com/p/epubcheck/&quot;&gt;epubcheck&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://readium.org/&quot;&gt;Readium Foundation&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/839711&quot;&gt;EPUB for archival preservation (2012 report)&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;http://www.europeanbooksellers.eu/positionpaper/interoperability-e-books-formats&quot;&gt;On the Interoperability of eBook Formats - report by the European Booksellers Federation&lt;/a&gt;&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/05/23/epub-archival-preservation-update/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2013/05/23/epub-archival-preservation-update</link>
                <guid>https://bitsgalore.org/2013/05/23/epub-archival-preservation-update</guid>
                <pubDate>2013-05-23T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Adventures in Debian packaging</title>
                <description>&lt;p&gt;About a year ago, work started on &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2012-02-20-summary-outputs-and-roadmap-feb-2012&quot;&gt;packaging SCAPE tools&lt;/a&gt;. &lt;em&gt;Jpylyzer&lt;/em&gt; was the first SCAPE tool that was &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2012-02-15-sustainability-and-adoption-preservation-tools&quot;&gt;turned into a Debian package&lt;/a&gt;. Some time later, the OPF set up a couple of machine images at Amazon Web Services, which can be used to &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2012-03-08-turning-github-code-debian-packages-opf-way&quot;&gt;create packages repeatedly using a virtual machine&lt;/a&gt;. Even though I’ve used the Amazon service a couple of times myself, I really know next to nothing about Debian packages, and it’s safe to say that the underlying build process has been more or less a complete mystery to me.&lt;/p&gt;

&lt;p&gt;To get a better understanding of the process for building Debian packages, I had a try at packaging &lt;em&gt;jpylyzer&lt;/em&gt; on my local machine (which runs on &lt;a href=&quot;http://blog.linuxmint.com/?p=2216&quot;&gt;Linux Mint 14&lt;/a&gt;). Some time ago Dave Tarrant and Rui Castro wrote &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;a nice step-by-step guide on building Debian packages&lt;/a&gt; on the OPF Wiki, so I tried to follow the instructions there. While working on this, I made some notes, mainly to remind myself of what I was doing. Then I realised that some of this might be useful to others as well, so I decided to turn it into a blog post.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;objectives&quot;&gt;Objectives&lt;/h2&gt;

&lt;p&gt;The objectives of this exercise were:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;to get more more familiar with the packaging process myself;&lt;/li&gt;
  &lt;li&gt;to provide some input on how useful the &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;guide&lt;/a&gt; on the OPF Wiki is from the perspective of someone who is largely ignorant of the packaging procedure;&lt;/li&gt;
  &lt;li&gt;to identify any problems in &lt;em&gt;jpylyzer&lt;/em&gt;’s packaging procedure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I did two experiments: first, I did a &lt;em&gt;very&lt;/em&gt; limited test where I tried to create a template directory structure using &lt;em&gt;debhelper&lt;/em&gt;, which would be the first step when starting from scratch. Since for &lt;em&gt;jpylyzer&lt;/em&gt; all the files in the &lt;em&gt;debian&lt;/em&gt; directory already exist, I then moved on to building &lt;em&gt;jpylyzer&lt;/em&gt; using the existing files.&lt;/p&gt;

&lt;h2 id=&quot;test-1-creating-the-directory-structure-from-scratch&quot;&gt;Test 1: creating the directory structure from scratch&lt;/h2&gt;

&lt;p&gt;For this, I first installed all the required packages listed in the &lt;em&gt;Pre-Requisites&lt;/em&gt; section of the &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;guide&lt;/a&gt; using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;apt-get &lt;span class=&quot;nb&quot;&gt;install &lt;/span&gt;build-essential dh-make devscripts debhelper lintian
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Subsequently I followed the instructions in the &lt;em&gt;Getting Started&lt;/em&gt; section. For this I simply created an empty directory:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;mkdir &lt;/span&gt;debtest_1.0.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And then:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;debtest_1.0.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then I ran &lt;em&gt;dh_make&lt;/em&gt;:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;dh_make
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in an error message, telling me that the package name and its version number should be separated by a dash (‘-‘) instead of an underscore (‘_’), or, alternatively, that the -p flag should be used. So I changed the directory name:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;mv &lt;/span&gt;debtest_1.0.0 debtest-1.0.0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Re-running &lt;em&gt;dh_make&lt;/em&gt;, it now accepted the directory name, but it complained about a missing tarball (which I purposefully didn’t make in this test). However, as &lt;em&gt;dh_make&lt;/em&gt; offered the suggestion to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--createorig&lt;/code&gt; option (which creates a tarball) I tried this:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;dh_make &lt;span class=&quot;nt&quot;&gt;--createorig&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in the creation of a &lt;em&gt;debian&lt;/em&gt; directory with file templates, and an (empty) tarball &lt;em&gt;debtest_1.0.0.orig.tar.gz&lt;/em&gt; which was created in the parent (&lt;em&gt;debtest&lt;/em&gt;) directory.&lt;/p&gt;

&lt;p&gt;So, apart from the dash/underscore mix-up this is all pretty straightforward.&lt;/p&gt;

&lt;h2 id=&quot;test-2-building-jpylyzer&quot;&gt;Test 2: building &lt;em&gt;jpylyzer&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;In this second test I tried to build &lt;em&gt;jpylyzer&lt;/em&gt; using the &lt;a href=&quot;https://github.com/openplanets/jpylyzer/tree/master/debian&quot;&gt;already existing files in the &lt;em&gt;debian&lt;/em&gt; folder of  &lt;em&gt;jpylyzer&lt;/em&gt;’s Git repository&lt;/a&gt;. First I cloned the repository to my local machine:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;git clone https://github.com/openplanets/jpylyzer.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Then I went into the &lt;em&gt;jpylyzer&lt;/em&gt; directory:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nb&quot;&gt;cd &lt;/span&gt;jpylyzer
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;From there I tried to build &lt;em&gt;jpylyzer&lt;/em&gt; directly, using the command given in the &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;guide’s&lt;/a&gt; &lt;em&gt;Building your package&lt;/em&gt; section:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;dpkg-buildpackage &lt;span class=&quot;nt&quot;&gt;-tc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;missing-changelog&quot;&gt;Missing changelog&lt;/h2&gt;

&lt;p&gt;The above command resulted in an error message about a missing &lt;em&gt;changelog&lt;/em&gt; file in the &lt;em&gt;debian&lt;/em&gt; folder. The &lt;em&gt;changelog&lt;/em&gt; section in the &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;OPF guide&lt;/a&gt; does mention an OPF-hosted GitHub 2 Changelog service, which is supposed to be callable from the rules file. But I don’t see any reference to it in jpylyzer’s &lt;a href=&quot;https://github.com/openplanets/jpylyzer/blob/master/debian/rules&quot;&gt;&lt;em&gt;rules&lt;/em&gt;&lt;/a&gt; file, so I don’t really know how this is supposed to work! To to keep going I simply grabbed the default changelog that was created by &lt;em&gt;debhelper&lt;/em&gt; in an earlier experiment. After this I ran the command again.&lt;/p&gt;

&lt;h2 id=&quot;unknown-commands-in-makefile&quot;&gt;Unknown commands in makefile&lt;/h2&gt;

&lt;p&gt;This time, &lt;em&gt;dpkg-buildpackage&lt;/em&gt; exited with the following errors:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;pymakespec --onefile jpylyzer.py
make[1]: pymakespec: Command not found
make[1]: *** [build] Error 127
make[1]: Leaving directory `/home/johan/debtest/jpylyzer&apos;
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;These errors arise from the following lines in &lt;em&gt;jpylyzer&lt;/em&gt;’s &lt;a href=&quot;https://github.com/openplanets/jpylyzer/blob/master/Makefile&quot;&gt;makefile&lt;/a&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;build:
    pymakespec --onefile jpylyzer.py
    pyinstaller jpylyzer.spec
    @echo &quot;Built in dist/jpylyzer&quot;
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &lt;em&gt;pymakespec&lt;/em&gt; and &lt;em&gt;pyinstaller&lt;/em&gt; commands above are most likely shell scripts that launch the &lt;em&gt;Makespec.py&lt;/em&gt; and &lt;em&gt;pyinstaller.py&lt;/em&gt; scripts that are both part of &lt;a href=&quot;http://www.pyinstaller.org/&quot;&gt;&lt;em&gt;PyInstaller&lt;/em&gt;&lt;/a&gt; (these are used for building an executable from the source code). However, neither the shell scripts nor any references to them are included in &lt;em&gt;jpylyzer&lt;/em&gt;’s repository (my best guess is that they exist only on a specific machine instance - perhaps the Amazon virtual machines?), so the makefile simply won’t work.&lt;/p&gt;

&lt;p&gt;I was able to fix this by changing the references to the shell scripts to this (using &lt;em&gt;PyInstaller 1.5&lt;/em&gt;):&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python /home/johan/pyinstall1.5/Makespec.py &lt;span class=&quot;nt&quot;&gt;--onefile&lt;/span&gt; jpylyzer.py
python home/johan/pyinstall1.5/pyinstaller.py jpylyzer.spec
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For &lt;em&gt;PyInstaller 2&lt;/em&gt; these two lines should be substituted by:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;python /home/johan/pyinstall/pyinstaller.py &lt;span class=&quot;nt&quot;&gt;--onefile&lt;/span&gt; jpylyzer.py	
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note here that &lt;em&gt;PyInstaller&lt;/em&gt; has no default installation location, and the file paths will vary from machine to machine!&lt;/p&gt;

&lt;p&gt;After making these changes I was able to run &lt;em&gt;dpkg-buildpackage&lt;/em&gt; without any problems:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;dpkg-buildpackage &lt;span class=&quot;nt&quot;&gt;-tc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Result: the following files were created in the repo’s parent directory:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;jpylyzer_1.9.0_amd64.changes&lt;/li&gt;
  &lt;li&gt;jpylyzer_1.9.0_amd64.deb&lt;/li&gt;
  &lt;li&gt;jpylyzer_1.9.0.dsc&lt;/li&gt;
  &lt;li&gt;jpylyzer_1.9.0.tar.gz&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;tarball-schmarball&quot;&gt;Tarball schmarball&lt;/h2&gt;

&lt;p&gt;One thing that confused me at first: the &lt;em&gt;Getting Started&lt;/em&gt; section in the &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;OPF guide&lt;/a&gt; mentions the need for building a &lt;em&gt;native&lt;/em&gt; package before starting the Debian packaging:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;If you have got here and you don’t have any already packaged code (a tar ball with makefile etc) then you will need to build a native package.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, I initially thought I would need to create a tarball of my repo first. As it turns out this is not the case: the tarball is created automatically once you run &lt;em&gt;dpkg-buildpackage&lt;/em&gt;. So this is one thing less to worry about!&lt;/p&gt;

&lt;h2 id=&quot;verifying-the-package-with-lintian&quot;&gt;Verifying the package with &lt;em&gt;lintian&lt;/em&gt;&lt;/h2&gt;

&lt;p&gt;As a final step I used &lt;em&gt;lintian&lt;/em&gt; to verify my package:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;lintian jpylyzer_1.9.0_amd64.deb
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This resulted in the following output (using &lt;em&gt;PyInstaller&lt;/em&gt; 1.5):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;E: jpylyzer: unstripped-binary-or-object usr/bin/jpylyzer
W: jpylyzer: hardening-no-fortify-functions usr/bin/jpylyzer
W: jpylyzer: wrong-bug-number-in-closes l3:#nnnn
E: jpylyzer: debian-changelog-file-contains-invalid-email-address johan@unknown
E: jpylyzer: helper-templates-in-copyright
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;With &lt;em&gt;PyInstaller 2&lt;/em&gt; I got this additional warning:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;W: jpylyzer: hardening-no-relro usr/bin/jpylyzer
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I still need to give these errors and warnings an in-depth look. At least one error is related to the bogus &lt;em&gt;changelog&lt;/em&gt; file I used. Some others (e.g. &lt;em&gt;unstripped-binary-or-object&lt;/em&gt;) appear to be related to the build process of the binaries.&lt;/p&gt;

&lt;h2 id=&quot;conclusions&quot;&gt;Conclusions&lt;/h2&gt;

&lt;p&gt;Using the &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;Building Your Debian Package&lt;/a&gt; guide on the OPF Wiki I was able to create a rudimentary skeleton structure for Debian packaging. I was also able to build a Debian package for &lt;em&gt;jpylyzer&lt;/em&gt;. The exercise revealed some problems with the Debian setup for &lt;em&gt;jpylyzer&lt;/em&gt;. The most important ones are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;It’s unclear how &lt;em&gt;jpylyzer&lt;/em&gt;’s &lt;em&gt;changelog&lt;/em&gt; file is supposed to be generated. Perhaps there’s a dependency on some external service (the OPF Github 2 Changelog service?), but I cannot find any documentation on how to make this work!&lt;/li&gt;
  &lt;li&gt;The makefile calls &lt;em&gt;PyInstaller&lt;/em&gt; in a non-standard an undocumented way. This is easy to fix locally if you are familiar with &lt;em&gt;PyInstaller&lt;/em&gt;, but not so otherwise. Also, the interfaces of versions 1.5 and 2 of &lt;em&gt;PyInstaller&lt;/em&gt; are different, and depending of what version you are running this may require additional changes to the makefile.&lt;/li&gt;
  &lt;li&gt;Even though I was able to build a Debian package for &lt;em&gt;jpylyzer&lt;/em&gt;, it still ended up with some &lt;em&gt;lintian&lt;/em&gt; errors.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also came across a few minor errors in the OPF guide. I left a short comment on this &lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;here&lt;/a&gt; (scroll to bottom). Overall, I found the guide really helpful, and it provides an accessible and relatively painless introduction to the packaging process.&lt;/p&gt;

&lt;h2 id=&quot;reference&quot;&gt;Reference&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://wiki.opf-labs.org/display/SP/Building+Your+Debian+Package&quot;&gt;Building Your Debian Package (OPF Wiki)&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;post-scriptum&quot;&gt;Post scriptum&lt;/h2&gt;

&lt;p&gt;Proof again that it’s always a bad idea to come up with a clever title for a blog post without &lt;a href=&quot;https://www.google.com/#hl=en&amp;amp;gs_rn=11&amp;amp;gs_ri=psy-ab&amp;amp;cp=2&amp;amp;gs_id=zz&amp;amp;xhr=t&amp;amp;q=%22adventures+in+Debian+Packaging%22&amp;amp;es_nrs=true&amp;amp;pf=p&amp;amp;sclient=psy-ab&amp;amp;oq=%22adventures+in+Debian+Packaging%22&amp;amp;gs_l=&amp;amp;pbx=1&amp;amp;bav=on.2,or.r_qf.&amp;amp;bvm=bv.45580626,d.d2k&amp;amp;fp=6b211d71a5d752ed&amp;amp;biw=1140&amp;amp;bih=553&quot;&gt;Googling&lt;/a&gt; it first: after writing this post I found out that the &lt;a href=&quot;http://mhvlug.org/&quot;&gt;Mid Hudson Valley Linux and Open Source Users Group&lt;/a&gt; will be organising a meeting called &lt;a href=&quot;http://mhvlug.org/meetings/2013/adventures-in-debian-packaging&quot;&gt;&lt;em&gt;Adventures in Debian Packaging&lt;/em&gt;&lt;/a&gt; later this year in Poughkeepsie, NY. Completely unrelated to this blog, of course, but it’s only fair to give it a mention. Well, there you go.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/04/23/adventures-debian-packaging/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2013/04/23/adventures-debian-packaging</link>
                <guid>https://bitsgalore.org/2013/04/23/adventures-debian-packaging</guid>
                <pubDate>2013-04-23T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>What do we mean by "embedded" files in PDF?</title>
                <description>&lt;p&gt;The most important new feature of the recently released &lt;a href=&quot;http://www.iso.org/iso/catalogue_detail.htm?csnumber=57229&quot;&gt;PDF/A-3&lt;/a&gt; standard is that, unlike &lt;em&gt;PDF/A-2&lt;/em&gt; and &lt;em&gt;PDF/A-1&lt;/em&gt;, it allows you to embed &lt;em&gt;any&lt;/em&gt; file you like. Whether this is a good thing or not is the subject of some &lt;a href=&quot;http://blogs.loc.gov/digitalpreservation/2012/11/all-in-embedded-files-in-pdfa/&quot;&gt;heated on-line discussions&lt;/a&gt;. But what do we actually mean by &lt;em&gt;embedded files&lt;/em&gt;? As it turns out, the answer to this question isn’t as straightforward as you might think. One of the reasons for this is that in colloquial use we often talk about “embedded files” to describe the inclusion of &lt;em&gt;any&lt;/em&gt; “non-text” element in a &lt;em&gt;PDF&lt;/em&gt; (e.g. an image, a video or a file attachment). On the other hand, the word “embedded files” in the &lt;em&gt;PDF&lt;/em&gt; standards (including &lt;em&gt;PDF/A&lt;/em&gt;) refers to something much more specific, which is closely tied to &lt;em&gt;PDF&lt;/em&gt;’s internal structure.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;embedded-files-and-embedded-file-streams&quot;&gt;Embedded files and embedded file streams&lt;/h2&gt;

&lt;p&gt;When the &lt;em&gt;PDF&lt;/em&gt; standard mentions “embedded files”, what it really refers to is a specific data structure. &lt;em&gt;PDF&lt;/em&gt; has a &lt;em&gt;File Specification Dictionary&lt;/em&gt; object, which in its simplest form is a table that contains a reference to some external file. &lt;em&gt;PDF 1.3&lt;/em&gt; extended this, making it possible to embed the contents of referenced files directly within the body of the &lt;em&gt;PDF&lt;/em&gt; using &lt;em&gt;Embedded File Streams&lt;/em&gt;. They are described in detail in Section 7.11.4 of  the &lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;&lt;em&gt;PDF&lt;/em&gt; Specification (ISO 32000)&lt;/a&gt;. A &lt;em&gt;File Specification Dictionary&lt;/em&gt; that refers to an embedded file can be identified by the presence of an &lt;em&gt;EF&lt;/em&gt; entry.&lt;/p&gt;

&lt;p&gt;Here’s an example (source: &lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;ISO 32000&lt;/a&gt;). First, here’s a file specification dictionary:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;31 0 obj   
&amp;lt;&amp;lt;/Type /Filespec /F (mysvg.svg) /EF &amp;lt;&amp;lt;/F 32 0 R&amp;gt;&amp;gt; &amp;gt;&amp;gt;    
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note the  &lt;em&gt;EF&lt;/em&gt; entry, which references another &lt;em&gt;PDF&lt;/em&gt; object. This is the actual embedded file stream. Here it is:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;32 0 obj   
&amp;lt;&amp;lt;/Type /EmbeddedFile /Subtype /image#2Fsvg+xml /Length 72&amp;gt;&amp;gt;   
stream  
…SVG Data…  
endstream  
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note that the part between the &lt;em&gt;stream&lt;/em&gt; and &lt;em&gt;endstream&lt;/em&gt; keywords holds the actual file data, here an &lt;em&gt;SVG&lt;/em&gt; image, but this could really be anything!&lt;/p&gt;

&lt;p&gt;So, in short, when the &lt;em&gt;PDF&lt;/em&gt; standard mentions “embedded files”, this really means &lt;em&gt;Embedded File Streams&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;so-what-about-embedded-images&quot;&gt;So what about “embedded” images?&lt;/h2&gt;

&lt;p&gt;Here’s the first source of confusion: if a &lt;em&gt;PDF&lt;/em&gt; contains images, we often colloquially call these “embedded”. However, internally they are not represented as &lt;em&gt;Embedded File Streams&lt;/em&gt;, but as so-called &lt;em&gt;Image XObjects&lt;/em&gt;. (In fact the &lt;em&gt;PDF&lt;/em&gt; standard also includes yet another structure called &lt;em&gt;inline images&lt;/em&gt;, but let’s forget about those just to avoid making things even more complicated.)&lt;/p&gt;

&lt;p&gt;Here’s an example of an &lt;em&gt;Image XObject&lt;/em&gt; (again taken from  &lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;ISO 32000&lt;/a&gt;):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;10 0 obj            % Image XObject
&amp;lt;&amp;lt; 
    /Type /XObject
    /Subtype /Image
    /Width 100
    /Height 200
    /ColorSpace /DeviceGray
    /BitsPerComponent 8
    /Length 2167
    /Filter /DCTDecode
&amp;gt;&amp;gt;
stream
…Image data…
endstream  
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Similar to embedded filestreams, the part between the &lt;em&gt;stream&lt;/em&gt; and &lt;em&gt;endstream&lt;/em&gt; keywords holds the actual image data. The difference is that only a limited set of pre-defined formats are allowed. These are defined by the &lt;em&gt;Filter&lt;/em&gt; entry (see Section 7.4 in &lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;ISO 32000&lt;/a&gt;). In the example above, the value of &lt;em&gt;Filter&lt;/em&gt; is &lt;em&gt;DCTDecode&lt;/em&gt;, which means we are dealing with &lt;em&gt;JPEG&lt;/em&gt; encoded image data.&lt;/p&gt;

&lt;h2 id=&quot;embedded-file-streams-and-file-attachments&quot;&gt;Embedded file streams and file attachments&lt;/h2&gt;

&lt;p&gt;Going back to embedded file streams, you may now start wondering what they are used for. According to Section 7.11.4.1 of &lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;ISO 32000&lt;/a&gt;, they are primarily intended as a mechanism to ensure that external references in a &lt;em&gt;PDF&lt;/em&gt; (i.e. references to other files) remain valid. It also states:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The embedded files are included purely for convenience and need not be directly processed by any conforming reader.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This suggests that the usage of embedded file streams is simply restricted to file attachments (through a &lt;em&gt;File Attachment Annotation&lt;/em&gt; or an &lt;em&gt;EmbeddedFiles&lt;/em&gt; entry in the document’s name dictionary).&lt;/p&gt;

&lt;p&gt;Here’s a sample file (created in &lt;em&gt;Adobe Acrobat&lt;/em&gt; 9) that illustrates this:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment.pdf&quot;&gt;http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/fileAttachment.pdf&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at the underlying code we can see the &lt;em&gt;File Specification Dictionary&lt;/em&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;37 0 obj    
&amp;lt;&amp;lt;
    /Desc()
    /EF&amp;lt;&amp;lt;/F 38 0 R&amp;gt;&amp;gt;
    /F(KSBASE.WQ2)
    /Type/Filespec/UF(KSBASE.WQ2)&amp;gt;&amp;gt;
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note the &lt;em&gt;/EF&lt;/em&gt; entry, which means the referenced file  is embedded (the actual file data are in a separate stream object).&lt;/p&gt;

&lt;p&gt;Further digging also reveals an &lt;em&gt;EmbeddedFiles&lt;/em&gt; entry:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;33 0 obj   
&amp;lt;&amp;lt;
    /EmbeddedFiles 34 0 R
    /JavaScript 35 0 R
&amp;gt;&amp;gt;   
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;However, careful inspection of &lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;ISO 32000&lt;/a&gt; reveals that embedded file streams can also be used for  multimedia! We’ll have a look at that in the next section…&lt;/p&gt;

&lt;h2 id=&quot;embedded-file-streams-and-multimedia&quot;&gt;Embedded file streams and multimedia&lt;/h2&gt;

&lt;p&gt;Section 13.2.1 (Multimedia) of the
&lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;PDF Specification (ISO 32000)&lt;/a&gt; describes how multimedia content is represented in &lt;em&gt;PDF&lt;/em&gt; (emphases added by me):&lt;/p&gt;

&lt;blockquote&gt;
  &lt;ul&gt;
    &lt;li&gt;
      &lt;p&gt;&lt;strong&gt;Rendition actions&lt;/strong&gt; (…) shall be used to begin the playing of multimedia content.&lt;/p&gt;
    &lt;/li&gt;
    &lt;li&gt;A rendition action associates a &lt;strong&gt;screen annotation&lt;/strong&gt; (…) with a &lt;strong&gt;rendition&lt;/strong&gt; (…)&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Renditions&lt;/strong&gt; are of two varieties: &lt;strong&gt;media renditions&lt;/strong&gt; (…) that define the characteristics of the media to be played, and selector renditions (…) that enables choosing which of a set of media renditions should be played.&lt;/li&gt;
    &lt;li&gt;&lt;strong&gt;Media renditions&lt;/strong&gt; contain entries that specify what should be played (…), how it should be played (…), and where it should be played (…)&lt;/li&gt;
  &lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;The actual data for a media object are defined by &lt;em&gt;Media Clip Objects&lt;/em&gt;, and more specifically by the &lt;em&gt;media clip data dictionary&lt;/em&gt;. Its description (Section 13.2.4.2) contains a note, saying that this dictionary “may reference a URL to a streaming video presentation or a movie &lt;em&gt;embedded in the PDF file&lt;/em&gt;”. The description of the media clip data dictionary (Table 274) also states that the actual media data are “either a full file specification or a form XObject”.&lt;/p&gt;

&lt;p&gt;In plain English, this means that multimedia content in &lt;em&gt;PDF&lt;/em&gt; (e.g. movies that are meant to be rendered by the viewer) may be represented internally as an embedded file stream.&lt;/p&gt;

&lt;p&gt;The following sample file illustrates this:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/embedded_video_quicktime.pdf&quot;&gt;http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/embedded_video_quicktime.pdf&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This &lt;em&gt;PDF&lt;/em&gt; 1.7 file was created in &lt;em&gt;Acrobat&lt;/em&gt; 9, and if you open it you will see a short &lt;em&gt;Quicktime&lt;/em&gt; movie that plays upon clicking on it.&lt;/p&gt;

&lt;p&gt;Digging through the underlying &lt;em&gt;PDF&lt;/em&gt; code reveals a &lt;em&gt;Screen Annotation&lt;/em&gt;, a &lt;em&gt;Rendition Action&lt;/em&gt; and a &lt;em&gt;Media clip data dictionary&lt;/em&gt;. The latter looks like this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;41 0 obj
&amp;lt;&amp;lt;
    /CT(video/quicktime)
    /D 42 0 R
    /N(Media clip from animation.mov)
    /P&amp;lt;&amp;lt;/TF(TEMPACCESS)&amp;gt;&amp;gt;
    /S/MCD
&amp;gt;&amp;gt;
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;It contains a reference to another object (42 0), which turns out to be a &lt;em&gt;File Specification Dictionary&lt;/em&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;42 0 obj
&amp;lt;&amp;lt;
    /EF&amp;lt;&amp;lt;/F 43 0 R&amp;gt;&amp;gt;
    /F(&amp;lt;embedded file&amp;gt;)
    /Type/Filespec
    /UF(&amp;lt;embedded file&amp;gt;)
&amp;gt;&amp;gt;
endobj
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;What’s particularly interesting here is the &lt;em&gt;/EF&lt;/em&gt; entry, which means we’re dealing with an embedded file stream here. (The actual movie data are in a stream object (43 0) that is referenced by the file specification dictionary.)&lt;/p&gt;

&lt;p&gt;So, the analysis of this sample file confirms that embedded filestreams are actually used by &lt;em&gt;Adobe Acrobat&lt;/em&gt; for multimedia content.&lt;/p&gt;

&lt;h2 id=&quot;what-does-pdfa-say-on-embedded-file-streams&quot;&gt;What does &lt;em&gt;PDF/A&lt;/em&gt; say on embedded file streams?&lt;/h2&gt;

&lt;p&gt;In &lt;a href=&quot;http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=38920&quot;&gt;&lt;strong&gt;PDF/A-1&lt;/strong&gt;&lt;/a&gt;, embedded file streams are not allowed at all:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A file specification dictionary (…) shall not contain the &lt;strong&gt;EF&lt;/strong&gt; key. A file’s name dictionary shall not contain the &lt;strong&gt;EmbeddedFiles&lt;/strong&gt; key&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In &lt;a href=&quot;http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=50655&quot;&gt;&lt;strong&gt;PDF/A-2&lt;/strong&gt;&lt;/a&gt;, embedded file streams &lt;em&gt;are&lt;/em&gt; allowed, but only if the embedded file itself is &lt;em&gt;PDF/A&lt;/em&gt; (1 or 2) as well:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;A file specification dictionary, as defined in ISO 32000-1:2008, 7.11.3, may contain the &lt;strong&gt;EF&lt;/strong&gt; key, provided that the embedded file is compliant with either ISO 19005-1 or this part of ISO 19005.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finally, in &lt;a href=&quot;http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=57229&quot;&gt;&lt;strong&gt;PDF/A-3&lt;/strong&gt;&lt;/a&gt; this last limitation was dropped, which means that &lt;em&gt;any&lt;/em&gt; file may be embedded&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;h2 id=&quot;does-this-mean-pdfa-3-supports-multimedia&quot;&gt;Does this mean &lt;em&gt;PDF/A-3&lt;/em&gt; supports multimedia?&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;No, not at all!&lt;/strong&gt; Even though nothing stops you from embedding multimedia content (e.g. a &lt;em&gt;Quicktime&lt;/em&gt; movie), you wouldn’t be able to use it as a renderable object inside a &lt;em&gt;PDF/A-3&lt;/em&gt; document. The reason is that the &lt;em&gt;annotations&lt;/em&gt; and &lt;em&gt;actions&lt;/em&gt; that are needed for this (e.g. &lt;em&gt;Screen&lt;/em&gt; annotations and &lt;em&gt;Rendition&lt;/em&gt; actions, to name but a few) are not allowed in &lt;em&gt;PDF/A-3&lt;/em&gt;. So effectively you are only able to use embedded file streams as attachments.&lt;/p&gt;

&lt;h2 id=&quot;adobe-adding-to-the-confusion&quot;&gt;Adobe adding to the confusion&lt;/h2&gt;

&lt;p&gt;A few weeks ago the embedding issue came up again in a &lt;a href=&quot;http://fileformats.wordpress.com/2012/12/22/not-pdf/&quot;&gt;blog post by Gary McGath&lt;/a&gt;. One of the comments there is from Adobe’s Leonord Rosenthol (who is also the Project Leader for &lt;em&gt;PDF/A&lt;/em&gt;). After correctly pointing out some mistakes in both the original blog post and in an earlier a comment by me, he nevertheless added to the confusion by stating that objects that are are rendered by the viewer (movies, etc.) all use &lt;em&gt;Annotations&lt;/em&gt;, and that embedded files (which he apparently uses a a synonym to attachments) are handled in a completely different manner. This doesn’t appear to be completely accurate: at least one class of renderable objects (screen annotations/rendition actions) may be using embedded filestreams. Also, embedded files that are used as attachments may be associated with a &lt;em&gt;File Attachment Annotation&lt;/em&gt;, which means that “under the hood” both cases are actually more similar than first meets the eye (which is confirmed by the analysis of the 2 sample files in the preceding sections). Contributing to this confusion is also the fact that Section 7.11.4 of &lt;a href=&quot;http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf&quot;&gt;ISO 32000&lt;/a&gt; erroneously states that embedded file streams are only used for non-renderable objects like file attachments, which is contradicted by their allowed use for multimedia content.&lt;/p&gt;

&lt;h2 id=&quot;does-any-of-this-matter-really&quot;&gt;Does any of this matter, really?&lt;/h2&gt;

&lt;p&gt;Some might argue that the above discussion is nothing but semantic nitpicking. However, details like these &lt;em&gt;do&lt;/em&gt; matter if we want to do a proper assessment of preservation risks in &lt;em&gt;PDF&lt;/em&gt; documents. As an example, &lt;a href=&quot;/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression&quot;&gt;in this previous blog post&lt;/a&gt; I demonstrated how a &lt;em&gt;PDF/A&lt;/em&gt; validator tool can be used to profile &lt;em&gt;PDF&lt;/em&gt;s for “risky” features. Such tools typically give you a list of features. It is then largely up to the user to further interpret this information.&lt;/p&gt;

&lt;p&gt;Now suppose we have a pre-ingest workflow that is meant to accept &lt;em&gt;PDF&lt;/em&gt;s with multimedia content, while at the same time rejecting file attachments. By only using the presence of an embedded file stream (reported by both &lt;em&gt;Apache&lt;/em&gt;’s and &lt;em&gt;Acrobat&lt;/em&gt;’s &lt;em&gt;Preflight&lt;/em&gt; tools) as a rejection criterion, we could end up unjustly rejecting files with multimedia content as well. To avoid this, we also need to take into account what the embedded file stream is used for, and for this we need to look at what annotation types are used, and the presence of any &lt;em&gt;EmbeddedFiles&lt;/em&gt; entry in the document’s name dictionary. However, if we don’t know precisely &lt;em&gt;which&lt;/em&gt; features we are looking for, we may well arrive at the wrong conclusions!&lt;/p&gt;

&lt;p&gt;This is made all the worse by the fact that preservation issues are often formulated in vague and non-specific ways. An example is &lt;a href=&quot;http://wiki.opf-labs.org/display/AQuA/Embedded+objects+in+PDFs&quot;&gt;this issue on the OPF Wiki on the detection of “embedded objects”&lt;/a&gt;. The issue’s description suggests that images and tables are the main concern (both of which aren’t strictly speaking embedded objects). The &lt;a href=&quot;http://wiki.opf-labs.org/display/AQuA/Detect%2C+extract+and+analyse+embedded+objects+in+PDFs&quot;&gt;corresponding solution page&lt;/a&gt; subsequently complicates things further by also throwing file attachments in the mix. In order to solve issues like these, it is helpful to know that images are (mostly) represented as &lt;em&gt;Image XObjects&lt;/em&gt; in &lt;em&gt;PDF&lt;/em&gt;. The solution should then be a method for detecting &lt;em&gt;Image XObjects&lt;/em&gt;. However, without some background knowledge of &lt;em&gt;PDF&lt;/em&gt;’s internal data structure, solving issues like these becomes a daunting, if not impossible task.&lt;/p&gt;

&lt;h2 id=&quot;final-note&quot;&gt;Final note&lt;/h2&gt;

&lt;p&gt;In this blog post I have tried to shed some light on a number of common misconceptions about embedded content in &lt;em&gt;PDF&lt;/em&gt;. I might have inadvertently created some new ones in the process, so feel free to contribute any corrections or additions using the comment fields below.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;PDF&lt;/em&gt; specification is vast and complex, and I have only addressed a limited number of its features here. For instance, one might argue that a discussion of embedding-related features should also include fonts, metadata, &lt;em&gt;ICC&lt;/em&gt; profiles, and so on. The coverage of multimedia features here is also incomplete, as I didn’t include &lt;em&gt;Movie Annotations&lt;/em&gt; or &lt;em&gt;Sound Annotations&lt;/em&gt; (which preceded the &lt;em&gt;Screen Annotations&lt;/em&gt;, which are now more commonly used). These things were all left out here because of time and space constraints.  This also means that further surprises may well be lurking ahead!&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2013/01/09/what-do-we-mean-embedded-files-pdf/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Source: &lt;a href=&quot;http://www.pdfa.org/2012/10/pdf-association-newsletter-issue-26/&quot;&gt;this unofficial newsletter item&lt;/a&gt;, as at this moment I don’t have access to the full specification of &lt;em&gt;PDF/A-3&lt;/em&gt;. &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2013/01/09/what-do-we-mean-embedded-files-pdf</link>
                <guid>https://bitsgalore.org/2013/01/09/what-do-we-mean-embedded-files-pdf</guid>
                <pubDate>2013-01-09T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Identification of PDF preservation risks with Apache Preflight&#58; a first impression</title>
                <description>&lt;p&gt;The &lt;em&gt;PDF&lt;/em&gt; format contains various features that may make it difficult to
access content that is stored in this format in the long term. Examples
include (but are not limited to):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Encryption features, which may either restrict some functionality
(copying, printing) or make files inaccessible altogether.&lt;/li&gt;
  &lt;li&gt;Multimedia features (embedded multimedia objects may be subject to
format obsolescence)&lt;/li&gt;
  &lt;li&gt;Reliance on external features (e.g. non-embedded fonts, or
references to external documents)&lt;/li&gt;
&lt;/ul&gt;

&lt;!-- more --&gt;

&lt;p&gt;A more exhaustive overview is given here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/801661&quot;&gt;Adobe Portable Document Format - Inventory of long-term preservation risks&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and also here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://web.archive.org/web/20130515073645/http://libraries.stackexchange.com/questions/964/what-preservation-risks-are-associated-with-the-pdf-file-format&quot;&gt;https://web.archive.org/web/20130515073645/http://libraries.stackexchange.com/questions/964/what-preservation-risks-are-associated-with-the-pdf-file-format&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When &lt;em&gt;creating&lt;/em&gt; a &lt;em&gt;PDF&lt;/em&gt;, it is possible to minimise these risks by using
one of the &lt;em&gt;PDF/A&lt;/em&gt; standards, which delineate a number of &lt;em&gt;PDF&lt;/em&gt; feature
profiles that are unlikely to result in any long-term accessibility
problems. However, the simple fact is that most &lt;em&gt;PDF&lt;/em&gt;s that are out
there are not &lt;em&gt;PDF/A&lt;/em&gt;.&lt;/p&gt;

&lt;h2 id=&quot;pdf-profiling&quot;&gt;PDF Profiling&lt;/h2&gt;

&lt;p&gt;For assessing risks in existing collections, it would be helpful to be
able to screen or profile &lt;em&gt;PDF&lt;/em&gt;s for specific ‘risky’ features, such as
encryption or font embedding. Since &lt;em&gt;PDF/A&lt;/em&gt; was specifically designed to
eliminate these ‘risky’ features, one would expect that &lt;em&gt;PDF/A&lt;/em&gt;
validators (i.e. software tools that check the conformance of a &lt;em&gt;PDF&lt;/em&gt;
file against the &lt;em&gt;PDF/A&lt;/em&gt; specification) would be able to provide some
useful information on this.&lt;/p&gt;

&lt;p&gt;In a first attempt to test whether this approach is feasible at all, I
did some tests with &lt;em&gt;Apache Preflight&lt;/em&gt;, an open-source &lt;em&gt;PDF/A-1&lt;/em&gt;
validator that is part of the &lt;a href=&quot;http://pdfbox.apache.org/&quot;&gt;&lt;em&gt;Apache
PDFBox&lt;/em&gt;&lt;/a&gt; library.The specific
objectives of this work were:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;To get a first impression of the &lt;em&gt;Apache Preflight&lt;/em&gt; (part of
&lt;em&gt;PDFBox&lt;/em&gt;) &lt;em&gt;PDF/A-1b&lt;/em&gt; validator.&lt;/li&gt;
  &lt;li&gt;To investigate if &lt;em&gt;Apache Preflight&lt;/em&gt; is able to detect unwanted
(from a preservation point of view) features in &lt;em&gt;PDF&lt;/em&gt; files (i.e.
&lt;em&gt;PDF&lt;/em&gt;s that are not necessarily of the &lt;em&gt;PDF/A&lt;/em&gt; sub-type) such as
password protection, encryption and non-embedded fonts.&lt;/li&gt;
  &lt;li&gt;To provide a comparison with the &lt;em&gt;Preflight&lt;/em&gt; module of &lt;em&gt;Adobe
Acrobat&lt;/em&gt; 9.5.&lt;/li&gt;
  &lt;li&gt;To decide if doing more work on &lt;em&gt;Apache Preflight&lt;/em&gt; (more elaborate
testing, possible involvement in its development) are worthwhile.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The results can be found in the report &lt;a href=&quot;https://zenodo.org/record/2556637&quot;&gt;Identification of preservation
risks in PDF with Apache Preflight: a first
impression&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;the-archivists-pdf-cabinet-of-horrors&quot;&gt;The Archivist’s PDF Cabinet of Horrors&lt;/h2&gt;

&lt;p&gt;The report’s findings are to a large extent based on a suite of small,
simple test files that were created especialy for this work. Each file
contains one ‘risky’ feature, with focus on the following feature
classes:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Encryption&lt;/li&gt;
  &lt;li&gt;Multimedia&lt;/li&gt;
  &lt;li&gt;Scripts&lt;/li&gt;
  &lt;li&gt;Fonts&lt;/li&gt;
  &lt;li&gt;File attachments&lt;/li&gt;
  &lt;li&gt;External references&lt;/li&gt;
  &lt;li&gt;Byte corruption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The dataset can be found here:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/&quot;&gt;http://www.opf-labs.org/format-corpus/pdfCabinetOfHorrors/&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;link-to-full-report&quot;&gt;Link to full report&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/2556637&quot;&gt;Identification of preservation risks in PDF with Apache Preflight: a
first impression&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;update-january-2013&quot;&gt;Update (January 2013)&lt;/h2&gt;

&lt;p&gt;Since the report was published, a number of improvements have been made
to Apache Preflight which should fix some of the reported issues. I
haven’t tested the latest version yet, but will try doing this some time
soon.&lt;/p&gt;

&lt;h2 id=&quot;update-march-2013&quot;&gt;Update (March 2013)&lt;/h2&gt;

&lt;p&gt;During the SPRUCE &lt;a href=&quot;http://wiki.opf-labs.org/display/SPR/SPRUCE+Hackathon+Leeds%2C+Unified+Characterisation&quot;&gt;hackathon on unified
characterisation&lt;/a&gt;
(Leeds, 11-12 March, 2013) additional work was done on this. See this
&lt;a href=&quot;https://openpreservation.org/blogs/2013-03-15-pdf-eh-another-hackathon-tale&quot;&gt;blog post by Pete
Cliff&lt;/a&gt;
for more details. Importantly, the tests done during the hackathon
showed that &lt;em&gt;Apache Preflight&lt;/em&gt;’s ability to identify ‘risky’ features
has improved significantly since I published my report back in December,
and the issues that are mentioned in the report appear to have been
largely resolved!&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression</link>
                <guid>https://bitsgalore.org/2012/12/19/identification-pdf-preservation-risks-apache-preflight-first-impression</guid>
                <pubDate>2012-12-19T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Automated assessment of JP2 against a technical profile</title>
                <description>&lt;p&gt;I’ve already written a number of blog posts on format validation of &lt;em&gt;JP2&lt;/em&gt; files. Format validation is only a one aspect of a quality assessment workflow. Digitisation guidelines typically impose various constraints on the technical characteristics of preservation and access images. For example, they may state that a preservation master must be losslessly compressed, and that its progression order must be &lt;em&gt;RPCL&lt;/em&gt;. A &lt;em&gt;format profile&lt;/em&gt; is a set of such technical constraints. The process that compares the technical characteristics of a file against a format profile is sometimes called &lt;a href=&quot;http://wiki.opf-labs.org/pages/viewpage.action?pageId=6062098&quot;&gt;&lt;em&gt;Policy Driven Validation&lt;/em&gt;&lt;/a&gt;. This corresponds to what &lt;em&gt;JHOVE2&lt;/em&gt; refers to as &lt;a href=&quot;https://bitbucket.org/jhove2/main/wiki/Glossary&quot;&gt;&lt;em&gt;Assessment&lt;/em&gt;&lt;/a&gt; (which I think is a better description).&lt;/p&gt;

&lt;p&gt;This blog post describes a simple method for doing a rule-based assessment of &lt;em&gt;JP2&lt;/em&gt; images. It uses &lt;a href=&quot;http://en.wikipedia.org/wiki/Schematron&quot;&gt;Schematron&lt;/a&gt;, which is a rule-based validation language, to ‘validate’ the output of &lt;em&gt;jpylyzer&lt;/em&gt; against a profile. Before getting into any technical details, let’s first have a look at an example of a format profile.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;example-format-profile&quot;&gt;Example format profile&lt;/h2&gt;

&lt;p&gt;The table below shows the format profile that we’ll be using throughout this blog post, which is a typical ‘access’-oriented profile using lossy compression. Note that it is provided here for illustrative purposes only!&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;&lt;strong&gt;Parameter&lt;/strong&gt;&lt;/th&gt;
      &lt;th&gt;&lt;strong&gt;Value&lt;/strong&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;File format&lt;/td&gt;
      &lt;td&gt;JP2 (JPEG 2000 Part 1)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Compression type&lt;/td&gt;
      &lt;td&gt;Lossy (irreversible 9-7 wavelet filter)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Colour transform&lt;/td&gt;
      &lt;td&gt;Yes (only for colour images)&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Number of decomposition levels&lt;/td&gt;
      &lt;td&gt;5&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Progression order&lt;/td&gt;
      &lt;td&gt;RPCL&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Tile size&lt;/td&gt;
      &lt;td&gt;1024 x 1024&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Code block size&lt;/td&gt;
      &lt;td&gt;64 x 64&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Precinct size&lt;/td&gt;
      &lt;td&gt;256 x 256 for 2 highest resolution levels; 128 x 128 for remaining resolution levels&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Number of quality layers&lt;/td&gt;
      &lt;td&gt;8&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Target compression ratio&lt;/td&gt;
      &lt;td&gt;20:1&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Error resilience&lt;/td&gt;
      &lt;td&gt;Start-of-packet headers; end-of-packet headers; segmentation symbols&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Grid resolution&lt;/td&gt;
      &lt;td&gt;Stored in “Capture Resolution” fields&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;ICC profiles&lt;/td&gt;
      &lt;td&gt;Embedded using “Restricted ICC” method&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Capture metadata&lt;/td&gt;
      &lt;td&gt;Embedded in XML box&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;corresponding-properties-in-jpylyzer-output&quot;&gt;Corresponding properties in &lt;em&gt;jpylyzer&lt;/em&gt; output&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Jpylyzer&lt;/em&gt; provides information on &lt;em&gt;all&lt;/em&gt; of the technical characteristics that are listed in the table. You can check this yourself by running &lt;em&gt;jpylyzer&lt;/em&gt; on any &lt;em&gt;JP2&lt;/em&gt; file and looking at the resulting output. A few examples:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;Compression type - value of &lt;em&gt;transformation&lt;/em&gt; field:&lt;/p&gt;

    &lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/jpylyzer/properties/contiguousCodestreamBox/cod/transformation
&lt;/code&gt;&lt;/pre&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Progression order - value of &lt;em&gt;order&lt;/em&gt; field:&lt;/p&gt;

    &lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/jpylyzer/properties/contiguousCodestreamBox/cod/order
&lt;/code&gt;&lt;/pre&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;ICC profiles - value of &lt;em&gt;meth&lt;/em&gt; field:&lt;/p&gt;

    &lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/jpylyzer/properties/jp2HeaderBox/colourSpecificationBox/meth
&lt;/code&gt;&lt;/pre&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;Grid resolution - presence of &lt;em&gt;captureResolutionBox&lt;/em&gt; element:&lt;/p&gt;

    &lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/jpylyzer/properties/jp2HeaderBox/resolutionBox
&lt;/code&gt;&lt;/pre&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;expressing-the-profile-as-a-set-of-assessable-rules&quot;&gt;Expressing the profile as a set of assessable rules&lt;/h2&gt;

&lt;p&gt;In order to assess &lt;em&gt;jpylyzer&lt;/em&gt;’s output against the profile, we first need to translate the profile to a set of assessable rules. This is where Schematron comes in. Look at parameter ‘Compression type’ in the table. In the previous section we saw that it corresponds to the &lt;em&gt;transformation&lt;/em&gt; field in &lt;em&gt;jpylyzer&lt;/em&gt;’s output. Below is a Schematron rule that asserts if &lt;em&gt;transformation&lt;/em&gt; has the required value:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jpylyzer/properties/contiguousCodestreamBox/cod&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;transformation = &apos;9-7 irreversible&apos;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;wrong transformation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;In words, the rule asserts that the value of &lt;em&gt;transformation&lt;/em&gt; (which is located in &lt;em&gt;/jpylyzer/properties/contiguousCodestreamBox/&lt;/em&gt;) equals &lt;em&gt;9-7 irreversible&lt;/em&gt;. If the rule fails, this will result in the error message “wrong transformation”.&lt;/p&gt;

&lt;p&gt;Both the location (&lt;em&gt;context&lt;/em&gt;) and the test statement are expressed using &lt;a href=&quot;http://www.w3schools.com/xpath/xpath_syntax.asp&quot;&gt;XPath syntax&lt;/a&gt;, which allows more complex tests as well.&lt;/p&gt;

&lt;h2 id=&quot;check-that-value-doesnt-exceed-threshold&quot;&gt;Check that value doesn’t exceed threshold&lt;/h2&gt;

&lt;p&gt;The following rule checks if the compression ratio doesn’t exceed a threshold value (this is actually a bit tricky, as for images that don’t contain much information very high compression ratios may be obtained without losing quality):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jpylyzer/properties&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;compressionRatio &amp;amp;lt; 35&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;Too much compression&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;(Note that the character reference “&amp;amp;lt” represents “&amp;lt;”, which isn’t allowed in &lt;em&gt;XML&lt;/em&gt;.)&lt;/p&gt;

&lt;h2 id=&quot;check-if-element-exists&quot;&gt;Check if element exists&lt;/h2&gt;

&lt;p&gt;The following Schematron rule checks if the &lt;em&gt;captureResolutionBox&lt;/em&gt; element exists:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jpylyzer/properties/jp2HeaderBox/resolutionBox&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;captureResolutionBox&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;no capture resolution box&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;outcome-depends-on-values-of-multiple-elements&quot;&gt;Outcome depends on values of multiple elements&lt;/h2&gt;

&lt;p&gt;Here’s a more complex rule that checks whether a colour transformation (&lt;em&gt;multipleComponentTransformation&lt;/em&gt;) was used while creating the image. A colour transformation is only possible for colour images, so in order to make this work for grayscale images as well, the rule must take into account that &lt;em&gt;multipleComponentTransformation&lt;/em&gt; will be ‘no’ in that case (&lt;em&gt;nC&lt;/em&gt; represents the number of image components):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jpylyzer/properties/contiguousCodestreamBox/cod&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;(multipleComponentTransformation = &apos;yes&apos;) and 
        (../../jp2HeaderBox/imageHeaderBox/nC = &apos;3&apos;) 
    or (multipleComponentTransformation = &apos;no&apos;) and 
        (../../jp2HeaderBox/imageHeaderBox/nC = &apos;1&apos;)&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
    no colour transformation&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;multiple-element-instances&quot;&gt;Multiple element instances&lt;/h2&gt;

&lt;p&gt;Our profile states that the precinct size must be 256 x 256 for the  2 highest resolution levels, and 128 x 128 for the remaining ones. These occur as multiple instances of the &lt;em&gt;precinctSize&lt;/em&gt; and &lt;em&gt;precinctSizeY&lt;/em&gt; element in &lt;em&gt;jpylyzer&lt;/em&gt;’s output, which we can handle as follows (note: for 5 decomposition levels we will have 6 resolution levels):&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:rule&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;context=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jpylyzer/properties/contiguousCodestreamBox/cod&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;precinctSizeY[1] = &apos;128&apos;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;precinctSizeY doesn&apos;t match profile&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;precinctSizeY[2] = &apos;128&apos;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;precinctSizeY doesn&apos;t match profile&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;precinctSizeY[3] = &apos;128&apos;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;precinctSizeY doesn&apos;t match profile&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;precinctSizeY[4] = &apos;128&apos;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;precinctSizeY doesn&apos;t match profile&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;precinctSizeY[5] = &apos;256&apos;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;precinctSizeY doesn&apos;t match profile&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;s:assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;precinctSizeY[6] = &apos;256&apos;&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;precinctSizeY doesn&apos;t match profile&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:assert&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/s:rule&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;the-full-profile-as-a-schema&quot;&gt;The full profile as a schema&lt;/h2&gt;

&lt;p&gt;A sample schema that covers all aspects of the example format profile is available &lt;a href=&quot;https://github.com/bitsgalore/jpylyzerProfileDemo/blob/master/demoAccessLossy.sch&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;assessment-of-jpylyzer-output-against-the-schema&quot;&gt;Assessment of &lt;em&gt;jpylyzer&lt;/em&gt; output against the schema&lt;/h2&gt;

&lt;p&gt;For the actual assessment (or validation) of &lt;em&gt;jpylyzer&lt;/em&gt; output against the schema a couple of options exist. Probably the most widely-used one is the &lt;a href=&quot;http://www.schematron.com/implementation.html&quot;&gt;ISO Schematron reference implementation&lt;/a&gt;. Validation using that software involves a number of successive &lt;em&gt;XSLT&lt;/em&gt; stylesheet transformations. A more accessible (but probaby less performant) alternative is the &lt;a href=&quot;http://www.probatron.org/probatron4j.html&quot;&gt;&lt;em&gt;Probatron&lt;/em&gt; command-line executable&lt;/a&gt;. Using &lt;em&gt;Probatron&lt;/em&gt;, asssessment of a &lt;em&gt;JP2&lt;/em&gt; would typically involve the following two steps:&lt;/p&gt;

&lt;h3 id=&quot;1-run-jpylyzer&quot;&gt;1. Run &lt;em&gt;jpylyzer&lt;/em&gt;&lt;/h3&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;jpylyzer balloon.jp2 &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; balloon_jp2.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;2-validate-jpylyzers-output-against-the-schema&quot;&gt;2. Validate &lt;em&gt;jpylyzer&lt;/em&gt;’s output against the schema&lt;/h3&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; probatron.jar balloon&lt;span class=&quot;se&quot;&gt;\_&lt;/span&gt;jp2.xml profile.sch &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; balloon_jp2_assessment.xml
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;example-output&quot;&gt;Example output&lt;/h2&gt;

&lt;p&gt;The above procedure produces an &lt;em&gt;XML&lt;/em&gt; file that contains a &lt;em&gt;failed assert&lt;/em&gt; element for each test that failed. For example, the output below is generated if the number of layers is wrong:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;nt&quot;&gt;&amp;lt;svrl:failed-assert&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;test=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;layers = &apos;8&apos;&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;location=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;/jpylyzer[1]/properties[1]/contiguousCodestreamBox[1]/cod[1]&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;line=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;45&quot;&lt;/span&gt; &lt;span class=&quot;na&quot;&gt;col=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;550&quot;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;&amp;gt;&lt;/span&gt;
&lt;span class=&quot;nt&quot;&gt;&amp;lt;svrl:text&amp;gt;&lt;/span&gt;wrong number of layers&lt;span class=&quot;nt&quot;&gt;&amp;lt;/svrl:text&amp;gt;&lt;/span&gt;  
&lt;span class=&quot;nt&quot;&gt;&amp;lt;/svrl:failed-assert&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;demo&quot;&gt;Demo&lt;/h2&gt;

&lt;p&gt;I created a small &lt;a href=&quot;https://github.com/bitsgalore/jpylyzerProfileDemo&quot;&gt;demo&lt;/a&gt; that illustrates the assessment procedure. It includes two &lt;em&gt;JP2&lt;/em&gt; images, the full schema of the example profile of this blog post, and a Windows batch file. For the moment it is located in my personal Github, but the schemas will probably be included in upcoming &lt;em&gt;jpylyzer&lt;/em&gt; releases. To use the demo, just &lt;a href=&quot;https://github.com/bitsgalore/jpylyzerProfileDemo/zipball/master&quot;&gt;download the ZIP file&lt;/a&gt;, unzip it, open the batch file in a text editor and  follow the instructions at the top of the file.&lt;/p&gt;

&lt;h2 id=&quot;final-note&quot;&gt;Final note&lt;/h2&gt;

&lt;p&gt;Although this blog post only covers the assessment of &lt;em&gt;JP2&lt;/em&gt; images using &lt;em&gt;jpylyzer&lt;/em&gt;, the same procedure can be used for other formats and tools (provided that the tools are capable of producing &lt;em&gt;XML&lt;/em&gt; output). Second, knowing that a &lt;em&gt;JP2&lt;/em&gt; is valid and  conforms to a technical profile is certainly important, but it doesn’t say anything about the (quality of the) actual image content. So in an operational setting this will often  require additional checks (e.g. a pixel-wise comparison between source and destination images).&lt;/p&gt;

&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;

&lt;p&gt;Big thanks go out to Adam Retter (The National Archives) for his suggestion to use &lt;em&gt;Schematron&lt;/em&gt;, &lt;a href=&quot;https://twitter.com/bitsgalore/status/232829808608948224&quot;&gt;just as I was struggling to make this work in &lt;em&gt;XSD&lt;/em&gt;&lt;/a&gt;. Adam also shared some of his own &lt;em&gt;Schematron&lt;/em&gt; schemas with me, which were a starting point for the work presented here.&lt;/p&gt;

&lt;h2 id=&quot;useful-links&quot;&gt;Useful links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.schematron.com/&quot;&gt;Schematron&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.w3schools.com/xpath/xpath_syntax.asp&quot;&gt;XPath syntax&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;Jpylyzer&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.probatron.org/probatron4j.html&quot;&gt;Probatron&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/jpylyzerProfileDemo&quot;&gt;Demo: check if JP2 file matches a technical profile&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/jpylyzerProfileDemo/zipball/master&quot;&gt;Demo (download as ZIP)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;post-script-february-2019&quot;&gt;Post script, February 2019&lt;/h2&gt;

&lt;p&gt;Since this post was originally published, &lt;em&gt;jpylyzer&lt;/em&gt;’s output format has changed slightly: from version 1.14.0 onward, all output elements have an associated namespace. This means that the Schematron rules must be adapted accordingly. A &lt;a href=&quot;https://github.com/KBNLresearch/jprofile/tree/master/jprofile/schemas&quot;&gt;set of example Schematron defintions that work with current versions of &lt;em&gt;jpylyzer&lt;/em&gt; can be found here&lt;/a&gt;. They are part of &lt;a href=&quot;https://github.com/KBNLresearch/jprofile&quot;&gt;&lt;em&gt;jprofile&lt;/em&gt;&lt;/a&gt;, a simple tool that we use at the KB to assess JP2s from external suppliers. The source code of &lt;em&gt;jprofile&lt;/em&gt; also demonstrates how to do this type of assessment in Python.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2012/09/04/automated-assessment-jp2-against-technical-profile/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2012/09/04/automated-assessment-jp2-against-technical-profile</link>
                <guid>https://bitsgalore.org/2012/09/04/automated-assessment-jp2-against-technical-profile</guid>
                <pubDate>2012-09-04T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Magic editing and creation&#58; a primer</title>
                <description>&lt;p&gt;The purpose of this post is to give a brief introduction to creating, editing and submitting format signatures (or ‘&lt;em&gt;magic&lt;/em&gt;’ entries) for the well-known &lt;em&gt;File&lt;/em&gt; tool. The occasion for this was some work I did last week on improving &lt;em&gt;File&lt;/em&gt;’s identification of the &lt;em&gt;JPEG 2000&lt;/em&gt; formats. I had some difficulty finding any easy-to-follow documentation that describes how to do this. The information is all out there, but it’s pretty fragmented. So, I wrote this brief tutorial, which is intended as an accessible introduction to &lt;em&gt;magic&lt;/em&gt; editing. It only covers the very basics, but hopefully this is enough to overcome some initial stumbling blocks.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;how-to-get-file&quot;&gt;How to get File&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;File&lt;/em&gt; is part of most &lt;em&gt;Unix&lt;/em&gt;/&lt;em&gt;Linux&lt;/em&gt; distributions. If you’re a &lt;em&gt;Windows&lt;/em&gt; user I would recommend to get &lt;a href=&quot;http://www.cygwin.com/&quot;&gt;Cygwin&lt;/a&gt;, which includes the latest version of &lt;em&gt;File&lt;/em&gt;. A stand-alone &lt;em&gt;Windows&lt;/em&gt; port of &lt;em&gt;File&lt;/em&gt; does exist, but apparently it is not maintained anymore.&lt;/p&gt;

&lt;h2 id=&quot;how-it-works&quot;&gt;How it works&lt;/h2&gt;

&lt;p&gt;Just like tools such as &lt;em&gt;DROID&lt;/em&gt;, &lt;em&gt;Fido&lt;/em&gt; and &lt;em&gt;Apache Tika&lt;/em&gt;, &lt;em&gt;File&lt;/em&gt;’s identification is based on format signatures (which in the case of &lt;em&gt;File&lt;/em&gt; are usually called &lt;em&gt;magic numbers&lt;/em&gt;. These are stored in a &lt;em&gt;magic&lt;/em&gt; file, which is located in the &lt;em&gt;magic&lt;/em&gt; directory. Typical locations of this &lt;em&gt;magic&lt;/em&gt; directory are:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/usr/share/misc
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/usr/share/file
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Or, on CygWin:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;/cygwin/usr/share/misc/
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;files-in-the-magic-directory-compiled-versus-source&quot;&gt;Files in the magic directory: compiled versus source&lt;/h2&gt;

&lt;p&gt;Inside the &lt;em&gt;magic&lt;/em&gt; directory you will see a file called  &lt;em&gt;magic.mgc&lt;/em&gt;. This is a &lt;em&gt;compiled&lt;/em&gt; &lt;em&gt;magic&lt;/em&gt; file. This is the one that &lt;em&gt;File&lt;/em&gt; uses by default (although you can override this behaviour, which we’ll see in a moment).&lt;/p&gt;

&lt;p&gt;Depending on your system, you may also see another file that is simply called &lt;em&gt;magic&lt;/em&gt;. This is the uncompiled (i.e. human-readable) source of the &lt;em&gt;magic&lt;/em&gt; file.&lt;/p&gt;

&lt;p&gt;Note that in the source distribution of &lt;em&gt;File&lt;/em&gt;, the source &lt;em&gt;magic&lt;/em&gt; is organised in a directory structure. You can see this in the &lt;a href=&quot;https://github.com/glensc/file/tree/master/magic/Magdir&quot;&gt;Github mirror of File’s source repository&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;compiling-a-magic-source-file&quot;&gt;Compiling a magic source file&lt;/h2&gt;

&lt;p&gt;So how do we get from a source file to a compiled &lt;em&gt;magic&lt;/em&gt; file? Easy: just run &lt;em&gt;File&lt;/em&gt; with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-C&lt;/code&gt; (compile) switch, and use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-m&lt;/code&gt; switch to specify the source file. For example, supposing we have a source file called &lt;em&gt;myMagic&lt;/em&gt;, compile it using:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;file &lt;span class=&quot;nt&quot;&gt;-C&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; myMagic
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This produces the compiled &lt;em&gt;magic&lt;/em&gt; file &lt;em&gt;myMagic.mgc&lt;/em&gt;. You can then use the compiled file by specifying its name using (again) the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-m&lt;/code&gt; switch, e,.g.:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;file &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; myMagic &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Note that in the above case &lt;em&gt;File&lt;/em&gt; actually uses &lt;em&gt;myMagic.mgc&lt;/em&gt; (apparently it expects this extension); the following command line:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;file &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; myMagic.mgc &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;produces identical results.&lt;/p&gt;

&lt;h2 id=&quot;format-of-the-magic-source-file&quot;&gt;Format of the magic source file&lt;/h2&gt;

&lt;p&gt;So what does the &lt;em&gt;magic&lt;/em&gt; source file look like? I will only stick to the basics here, as an exhaustive description of the &lt;em&gt;magic&lt;/em&gt; source format is given in the  &lt;a href=&quot;http://manpages.ubuntu.com/manpages/precise/en/man5/magic.5.html&quot;&gt;man pages&lt;/a&gt; of &lt;em&gt;magic&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;The most important thing to remember is that each line in the file specifies a test. Each test is made up of 4 items, which are separated by one or more whitespace characters (usually tabs, but spaces appear to work as well):&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;offset&lt;/em&gt; - specifies the offset, in bytes, into the file of the data which is to be tested.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;type&lt;/em&gt; - the type of the data to be tested (see &lt;a href=&quot;http://manpages.ubuntu.com/manpages/precise/en/man5/magic.5.html&quot;&gt;here&lt;/a&gt; for a list of all possible values) .&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;test&lt;/em&gt; - the value to be compared with the value from the file.&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;message&lt;/em&gt; - the message to be printed if the comparison succeeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s an example (actually this is the &lt;em&gt;JPEG 2000&lt;/em&gt; &lt;em&gt;magic&lt;/em&gt; that is currently used in &lt;em&gt;File&lt;/em&gt; 5.11):&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;0	string	\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A	JPEG 2000
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Here we have a test that compares the start of a file object (byte offset 0) against a 12-character string (here represented as hexadecimal codes). If the pattern is found, the text &lt;em&gt;JPEG 2000&lt;/em&gt; will be printed to screen. Note that for most formats we will need further &lt;em&gt;sublevel tests&lt;/em&gt;, which I will illustrate in the next section. Also, even though this introduction only decribes tests at fixed byte-positions, more sophisticated tests are possible as well. See the &lt;a href=&quot;http://manpages.ubuntu.com/manpages/precise/en/man5/magic.5.html&quot;&gt;man pages&lt;/a&gt; for details on this.&lt;/p&gt;

&lt;h2 id=&quot;improving-the-jpeg-2000-magic&quot;&gt;Improving the JPEG 2000 magic&lt;/h2&gt;

&lt;p&gt;The main problem of the above &lt;em&gt;JPEG 2000&lt;/em&gt; entry is that it only checks for the first 12 bytes of a file. This is enough for establishing that a file is part of the &lt;em&gt;JPEG 2000&lt;/em&gt; ‘family’ of file formats, but it doesn’t tell you the exact sub-format, which can be any of the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;JP2&lt;/em&gt; (basic still image format)&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;JPX&lt;/em&gt; (extended still image format)&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;JPM&lt;/em&gt; (compound format)&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;MJ2&lt;/em&gt; (Motion &lt;em&gt;JPEG 2000&lt;/em&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words: although &lt;em&gt;File&lt;/em&gt; may be telling us that a file object matches &lt;em&gt;JPEG 2000&lt;/em&gt;, this gives us zero information on whether it contains simple image data (&lt;em&gt;JP2&lt;/em&gt;) or video content (&lt;em&gt;MJ2&lt;/em&gt;)! Fortunately, the headers of each of the above formats include a &lt;em&gt;Brand&lt;/em&gt; field, which is a 4-byte string of characters that uniquely identifies each format. This string starts at byte 20, and for &lt;em&gt;JP2&lt;/em&gt; it equals ‘\x6a\x70\x32\x20’, so we can simply add this as a second test to the existing &lt;em&gt;magic&lt;/em&gt; entry:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;&amp;gt;20	string	\x6a\x70\x32\x20	Part 1 (JP2)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Note the “&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;gt;&lt;/code&gt;” character, which indicates that this is a higher-level test on top of the previous one.&lt;/p&gt;

&lt;h2 id=&quot;adding-mimetype-information&quot;&gt;Adding mimetype information&lt;/h2&gt;

&lt;p&gt;Optionally, tests may also be associated with a &lt;em&gt;mimetype&lt;/em&gt;. In that case the line that represents the test is followed by a second line, which contains the following items:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;em&gt;!:mime&lt;/em&gt; - indicates that this line is a &lt;em&gt;mimetype&lt;/em&gt; declaration&lt;/li&gt;
  &lt;li&gt;&lt;em&gt;MIME Type&lt;/em&gt; - the actual &lt;em&gt;mimetype&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;em&gt;JP2&lt;/em&gt; this is:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;!:mime	image/jp2
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;adding-the-other-formats&quot;&gt;Adding the other formats&lt;/h2&gt;

&lt;p&gt;Repeating the above procedure for the full set of &lt;em&gt;JPEG 2000&lt;/em&gt; formats we end up with this:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;0	string	\x00\x00\x00\x0C\x6A\x50\x20\x20\x0D\x0A\x87\x0A	JPEG 2000
&amp;gt;20	string	\x6a\x70\x32\x20	Part 1 (JP2)
!:mime	image/jp2
&amp;gt;20	string	\x6a\x70\x78\x20	Part 2 (JPX)
!:mime	image/jpx
&amp;gt;20	string	\x6a\x70\x6d\x20	Part 6 (JPM)
!:mime	image/jpm
&amp;gt;20	string	\x6d\x6a\x70\x32	Part 3 (MJ2)
!:mime video/mj2
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;compiling-the-file&quot;&gt;Compiling the file&lt;/h2&gt;

&lt;p&gt;To create the compiled file, follow the simple steps below::&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;
    &lt;p&gt;Save your &lt;em&gt;magic&lt;/em&gt; entry to a file (e.g. &lt;em&gt;jpeg2000Magic&lt;/em&gt;). See also the &lt;a href=&quot;https://github.com/bitsgalore/jp2kMagic/blob/master/magic/jpeg2000Magic&quot;&gt;complete entry on my personal Github repo&lt;/a&gt;.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;If you’re using &lt;em&gt;Windows&lt;/em&gt;, you may need to convert &lt;em&gt;Windows&lt;/em&gt;-style linebreaks to &lt;em&gt;Unix&lt;/em&gt; linebreaks, for which you can use the &lt;em&gt;dos2unix&lt;/em&gt; tool (which is included in &lt;em&gt;Cygwin&lt;/em&gt;):&lt;/p&gt;
  &lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;dos2unix jpeg2000Magic
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ol&gt;
  &lt;li&gt;Now compile the file:&lt;/li&gt;
&lt;/ol&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;file &lt;span class=&quot;nt&quot;&gt;-C&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; jpeg2000Magic
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Et voilà&lt;/em&gt;, our compiled &lt;em&gt;magic&lt;/em&gt; file is ready for use!&lt;/p&gt;

&lt;h2 id=&quot;testing-it&quot;&gt;Testing it&lt;/h2&gt;

&lt;p&gt;For testing, run &lt;em&gt;File&lt;/em&gt; on any number of files, e.g.:&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;file &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-m&lt;/span&gt; jpeg2000Magic &lt;span class=&quot;k&quot;&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Here’s an example of the output you may get:&lt;/p&gt;

&lt;pre&gt;&lt;code class=&quot;language-data&quot;&gt;balloon.jp2:  image/jp2; charset=binary
balloon.jpf:  image/jpx; charset=binary
balloon.jpm:  image/jpm; charset=binary
Speedway.mj2: video/mj2; charset=binary
&lt;/code&gt;&lt;/pre&gt;

&lt;h2 id=&quot;submitting-magic&quot;&gt;Submitting magic&lt;/h2&gt;

&lt;p&gt;File’s &lt;a href=&quot;https://github.com/glensc/file&quot;&gt;source repository on Github&lt;/a&gt; gives a number of guidelines for submitting &lt;em&gt;magic&lt;/em&gt;. &lt;em&gt;Don’t&lt;/em&gt; use the Github repo for submitting stuff or commenting, as it is just a read-only mirror! Instead, make your submits through the &lt;a href=&quot;http://mx.gw.com/mailman/listinfo/file&quot;&gt;&lt;em&gt;File&lt;/em&gt; mailing list&lt;/a&gt;, or use the &lt;a href=&quot;http://bugs.gw.com/my_view_page.php&quot;&gt;bug tracker&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;useful-links&quot;&gt;Useful links&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.darwinsys.com/file/&quot;&gt;Fine Free File Command&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://manpages.ubuntu.com/manpages/precise/en/man5/magic.5.html&quot;&gt;Documentation of the &lt;em&gt;magic&lt;/em&gt; file&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/glensc/file&quot;&gt;Read-only mirror of &lt;em&gt;File&lt;/em&gt; CVS repository, updated nightly&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://mx.gw.com/mailman/listinfo/file&quot;&gt;&lt;em&gt;File&lt;/em&gt; mailing list&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.cygwin.com/&quot;&gt;Cygwin&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.iana.org/assignments/media-types/index.html&quot;&gt;MIME Media Types at the Internet Assigned Numbers Authority&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/bitsgalore/jp2kMagic/blob/master/magic/jpeg2000Magic&quot;&gt;Link to updated &lt;em&gt;JPEG 2000&lt;/em&gt; &lt;em&gt;magic&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2012/08/09/magic-editing-and-creation-primer/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2012/08/09/magic-editing-and-creation-primer</link>
                <guid>https://bitsgalore.org/2012/08/09/magic-editing-and-creation-primer</guid>
                <pubDate>2012-08-09T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>PDF – Inventory of long-term preservation risks</title>
                <description>&lt;p&gt;In this blog post I’ll be dusting off some old stuff for a change. The occasion for this is the following question,  posted by Paul Wheatley on the &lt;a href=&quot;http://libraries.stackexchange.com/&quot;&gt;Libraries and Information Science Stack Exchange website&lt;/a&gt; a few days ago:&lt;/p&gt;
&lt;blockquote&gt;
  &lt;p&gt;&lt;a href=&quot;http://libraries.stackexchange.com/questions/964/what-preservation-risks-are-associated-with-the-pdf-file-format&quot;&gt;What preservation risks are associated with the PDF file format?&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;report&quot;&gt;Report&lt;/h2&gt;

&lt;p&gt;This reminded me of a &lt;a href=&quot;https://zenodo.org/record/801661&quot;&gt;report I wrote on this very subject&lt;/a&gt;  back in 2009. (Incidentally this was my very first foray into the wacky world of digital preservation, but that’s another story.) Originally this document was intended for internal use at the &lt;em&gt;KB&lt;/em&gt;, but looking at it again, I think it may be of interest to a wider audience. It also aligns quite nicely with the upcoming work on a knowledge base of file-format related risks that will be done as part of the &lt;a href=&quot;http://www.scape-project.eu/&quot;&gt;SCAPE&lt;/a&gt; project. The main idea here is to take a file format, identify its main (preservation-related) risks, and describe how “risky” features can be detected by existing (characterisation) tools. In fact I was envisaging something along these lines when I wrote &lt;em&gt;PDF&lt;/em&gt; report in 2009, but other things got in the way, and I never got round to the final step. The SCAPE work should finally make this happen.&lt;/p&gt;

&lt;p&gt;Although the work on the knowledge base is still in its early stages, some very first results can be found &lt;a href=&quot;http://wiki.opf-labs.org/display/TR/Formats&quot;&gt;here&lt;/a&gt;. The initial focus will be on &lt;em&gt;JPEG 2000&lt;/em&gt; (&lt;em&gt;JP2&lt;/em&gt;/&lt;em&gt;JPX&lt;/em&gt;) and &lt;em&gt;PDF&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;As for the report, I should add that some of it is a little rough around the edges, and you may note some gaps and not-quite-finished bits. This is also why we never released this first time around. Also, one aspect that is not well covered is &lt;em&gt;PDF&lt;/em&gt;’s potential for transmitting viruses and other malware. Nevertheless, as a general introduction to the format and an overview of its main risks I think it’s not too shabby, but I’ll let you be the judge of that! As always, feel free to use the comment fields for you feedback and suggestions.&lt;/p&gt;

&lt;h2 id=&quot;link-to-report&quot;&gt;Link to report&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/801661&quot;&gt;&lt;em&gt;Adobe Portable Document Format - Inventory of long-term preservation risks&lt;/em&gt;, KB/ National Library of the Netherlands&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2012/07/26/pdf-inventory-long-term-preservation-risks/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2012/07/26/pdf-inventory-long-term-preservation-risks</link>
                <guid>https://bitsgalore.org/2012/07/26/pdf-inventory-long-term-preservation-risks</guid>
                <pubDate>2012-07-26T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>EPUB for archival preservation</title>
                <description>&lt;p&gt;Over the last few years, the &lt;a href=&quot;http://idpf.org/epub&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt;&lt;/a&gt; format has gained widespread popularity in the consumer market. The &lt;a href=&quot;http://www.kb.nl/index-en.html&quot;&gt;KB&lt;/a&gt; has been approached by a number of publishers that wish to use &lt;em&gt;EPUB&lt;/em&gt; for delivering some of  their electronic publications. Surprisingly little information is available on the format’s suitability for archival preservation, apart from Library of Congress’ &lt;a href=&quot;http://www.digitalpreservation.gov/formats/&quot;&gt;&lt;em&gt;Sustainability of Digital Formats&lt;/em&gt;&lt;/a&gt; web pages, which contain entries on &lt;a href=&quot;http://www.digitalpreservation.gov/formats/fdd/fdd000278.shtml&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt; 2&lt;/a&gt; and &lt;a href=&quot;http://www.digitalpreservation.gov/formats/fdd/fdd000308.shtml&quot;&gt;&lt;em&gt;EPUB&lt;/em&gt; 3&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;So, the KB’s Departments of Collection and Collection Care requested a more detailed investigation of &lt;em&gt;EPUB&lt;/em&gt;’s preservation credentials. More specifically, answers were needed to the following questions:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;What are the main characteristics of EPUB?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;What functionality does EPUB provide, and is this sufficient for representing e.g. content with sophisticated layout and typography requirements?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;How well is the &lt;em&gt;EPUB&lt;/em&gt; supported by software tools that are used in (pre-)ingest workflows?&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;How suitable is &lt;em&gt;EPUB&lt;/em&gt;  for archival preservation? What are the main risks?&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;epub-for-archival-preservation&quot;&gt;EPUB for archival preservation&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://zenodo.org/record/839711&quot;&gt;report &lt;em&gt;EPUB for archival preservation&lt;/em&gt;&lt;/a&gt; is a first attempt at answering these questions as well as possible. It starts out with a simple example that illustrates the general structure of an EPUB file, followed by a more in-depth discussion on of specific aspects of the format. It then covers functionality-related aspects such as layout, appearance and multimedia support, and the main differences between &lt;em&gt;EPUB&lt;/em&gt; 2 and &lt;em&gt;EPUB&lt;/em&gt; 3.&lt;/p&gt;

&lt;p&gt;Support by characterisation tools is important for processing &lt;em&gt;EPUB&lt;/em&gt; files in an operational workflow, so a brief review (and some preliminary tests) of relevant identification, validation and feature extraction tools is included as well.&lt;/p&gt;

&lt;p&gt;To assess the overall suitability of &lt;em&gt;EPUB&lt;/em&gt; for preservation, the format was evaluated against a set of widely used criteria (mainly from The National Archives and Library of Congress). The final chapter wraps up the main conclusions, and suggests a number of recommendations.&lt;/p&gt;

&lt;h2 id=&quot;community-input&quot;&gt;Community input&lt;/h2&gt;

&lt;p&gt;Since it appears that not much has been published on &lt;em&gt;EPUB&lt;/em&gt; within an archival preservation context so far, we would really appreciate to hear your thoughts on the report. Is anything important missing? Did I overlook any relevant tools? Is there anything in particular that you strongly disagree with? Please use the comment fields below to let us know!&lt;/p&gt;

&lt;p&gt;In addition, the final chapter contains two subsections with &lt;em&gt;Community Recommendations&lt;/em&gt; and &lt;em&gt;Tool Recommendations&lt;/em&gt;. These are all things we can do as a community to simplify the use of &lt;em&gt;EPUB&lt;/em&gt; in archival settings. Please consider getting involved if you feel you could make a contribution.&lt;/p&gt;

&lt;h2 id=&quot;link-to-report&quot;&gt;Link to report&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/839711&quot;&gt;&lt;em&gt;EPUB for archival preservation&lt;/em&gt;, KB/ National Library of the Netherlands&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2012/06/18/epub-archival-preservation/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2012/06/18/epub-archival-preservation</link>
                <guid>https://bitsgalore.org/2012/06/18/epub-archival-preservation</guid>
                <pubDate>2012-06-18T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Update on jpylyzer</title>
                <description>&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;

&lt;p&gt;In this blog post I will give a brief update of the latest &lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;&lt;em&gt;jpylyzer&lt;/em&gt;&lt;/a&gt; developments. &lt;em&gt;Jpylyzer&lt;/em&gt; is a validation and feature extraction tool for the &lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;JP2 (JPEG 2000 Part 1)&lt;/a&gt; still image format.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;history-of-jpylyzer&quot;&gt;History of jpylyzer&lt;/h2&gt;

&lt;p&gt;Around mid-summer 2011, the &lt;a href=&quot;http://www.kb.nl/index-en.html&quot;&gt;KB&lt;/a&gt; started initial preparations for migrating 146 TB of TIFF images from the Dutch &lt;a href=&quot;http://www.metamorfoze.nl/english/home&quot;&gt;Metamorfoze&lt;/a&gt; program to JP2. We realised that the possibility of hardware failure (e.g. short network interruptions) during the migration process would imply a major risk for the creation of malformed and damaged files. Around the same time, we received some rather worrying reports from the &lt;a href=&quot;http://www.bl.uk/&quot;&gt;British Library&lt;/a&gt;, who were confronted with JP2 images that contained damage that couldn’t be detected with existing tools such as &lt;a href=&quot;http://hul.harvard.edu/jhove/&quot;&gt;JHOVE&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This prompted me to have a go at  writing a rudimentary software tool that was able to detect some simple forms of file corruption in JP2. A &lt;a href=&quot;/2011/09/01/simple-jp2-file-structure-checker&quot;&gt;blog post&lt;/a&gt; I wrote on this resulted in quite a bit of feedback, and several people asked about the possibility to extend the tool’s functionality to a full-fledged JP2 validator and feature extractor. Since this fitted in nicely with some &lt;a href=&quot;http://www.scape-project.eu/&quot;&gt;SCAPE&lt;/a&gt; work that was envisaged on quality assurance in imaging workflows, I started work on a &lt;a href=&quot;/2011/12/14/prototype-jp2-validator-and-properties-extractor&quot;&gt;first prototype of &lt;em&gt;jpylyzer&lt;/em&gt;&lt;/a&gt;, which saw the light of day in December.&lt;/p&gt;

&lt;p&gt;In the remainder of this blog post I will outline the main developments that have happened since then.&lt;/p&gt;

&lt;h2 id=&quot;refactoring-of-existing-code&quot;&gt;Refactoring of existing code&lt;/h2&gt;

&lt;p&gt;Shortly after the release of the first prototype, my KB colleague René van der Ark spontaneously offered to do a refactoring job on my original code, which was clumsy and unnecessarily lengthy in places. This has resulted in a code that is more modular, and which adheres more closely to established programming practices. As a result, the refactored code is significantly more maintainable than the original one, which makes it easier for other programmers to contribute to &lt;em&gt;jpylyzer&lt;/em&gt;. This should also contribute to the long-term sustainability of the software.&lt;/p&gt;

&lt;h2 id=&quot;new-features&quot;&gt;New features&lt;/h2&gt;

&lt;p&gt;Since the first prototype the following functionality was added to &lt;em&gt;jpylyzer&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;New validator functions were added for &lt;em&gt;XML&lt;/em&gt; boxes, &lt;em&gt;UUID&lt;/em&gt; boxes, &lt;em&gt;UUID Info&lt;/em&gt; boxes, &lt;em&gt;Palette&lt;/em&gt; boxes and &lt;em&gt;Component Mapping&lt;/em&gt; boxes.&lt;/li&gt;
  &lt;li&gt;A check was added that verifies whether the number of tiles in an image matches the image- and tile size information in the codestream header.&lt;/li&gt;
  &lt;li&gt;Another check was added that verifies whether all tile-parts within each tile exist.&lt;/li&gt;
  &lt;li&gt;The ICC profile feature extraction function has been given an overhaul, and it now extracts all ICC header items. In addition, the output is now reported in a more user-friendly format.&lt;/li&gt;
  &lt;li&gt;For codestream comments, all characters from the Latin character set are now supported.&lt;/li&gt;
  &lt;li&gt;The reporting of the validation results has been made more concise. By default, &lt;em&gt;jpylyzer&lt;/em&gt; now only reports the results of validation tests that &lt;em&gt;failed&lt;/em&gt; (previously &lt;em&gt;all&lt;/em&gt; test results were reported). This behaviour can be overruled with a new &lt;em&gt;–verbose&lt;/em&gt; switch.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition to the above, various bugs and minor issues have been addressed as well.&lt;/p&gt;

&lt;h2 id=&quot;debian-packages&quot;&gt;Debian packages&lt;/h2&gt;

&lt;p&gt;During the SCAPE Braga meeting in February, work started on the creation of &lt;a href=&quot;http://en.wikipedia.org/wiki/Deb_%28file_format%29&quot;&gt;Debian packages&lt;/a&gt; for &lt;em&gt;jpylyzer&lt;/em&gt;. The availability of Debian packages greatly simplifies &lt;em&gt;jpylyzer&lt;/em&gt;’s installation on Linux-based systems. &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2012-02-15-sustainability-and-adoption-preservation-tools&quot;&gt;This work&lt;/a&gt; was done by Dave Tarrant &lt;a href=&quot;http://www.southampton.ac.uk/&quot;&gt;(University of Southampton)&lt;/a&gt;, Miguel Ferreira, Rui Castro, Hélder Silva &lt;a href=&quot;http://www.keep.pt/en&quot;&gt;(KEEP Solutions)&lt;/a&gt; and Rainer Schmidt &lt;a href=&quot;http://www.ait.ac.at/&quot;&gt;(AIT)&lt;/a&gt;.&lt;/p&gt;

&lt;h2 id=&quot;jpylyzer-now-hosted-by-opf&quot;&gt;Jpylyzer now hosted by OPF&lt;/h2&gt;

&lt;p&gt;In order to make a tool sustainable, it is important that its maintenance and development are not solely dependent on one single institution or person. Because of this, &lt;em&gt;jpylyzer&lt;/em&gt; is now hosted by the Open Planets Foundation, which ensures the involvement of a wider community. &lt;em&gt;Jpylyzer&lt;/em&gt; also has &lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;its own home page on the OPF site&lt;/a&gt;. It contains links to the source code, Windows executables, Debian packages and the User Manual.&lt;/p&gt;

&lt;h2 id=&quot;jpylyzer-home-page&quot;&gt;Jpylyzer home page&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;http://jpylyzer.openpreservation.org/&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2012/04/23/update-jpylyzer/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2012/04/23/update-jpylyzer</link>
                <guid>https://bitsgalore.org/2012/04/23/update-jpylyzer</guid>
                <pubDate>2012-04-23T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Jpylyzer documentation</title>
                <description>&lt;p&gt;This will be my shortest blog post ever. Following up on &lt;a href=&quot;/2011/12/14/prototype-jp2-validator-and-properties-extractor&quot;&gt;my previous
blog post on a prototype JP2 validator and properties
extractor&lt;/a&gt;
(jpylyzer), there is now a comprehensive User Manual of the tool. Just
follow the link below:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://jpylyzer.openpreservation.org/userManual.html&quot;&gt;http://jpylyzer.openpreservation.org/userManual.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Link to &lt;em&gt;jpylyzer&lt;/em&gt; home page:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;http://jpylyzer.openpreservation.org/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Meanwhile work on &lt;em&gt;jpylyzer&lt;/em&gt; remains ongoing, so watch this space for
any updates on this.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update February 2019&lt;/strong&gt;: updated links in original blog post&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2012/01/10/jpylyzer-documentation/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2012/01/10/jpylyzer-documentation</link>
                <guid>https://bitsgalore.org/2012/01/10/jpylyzer-documentation</guid>
                <pubDate>2012-01-10T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>A prototype JP2 validator and properties extractor</title>
                <description>&lt;p&gt;A few months ago I wrote a &lt;a href=&quot;/2011/09/01/simple-jp2-file-structure-checker&quot;&gt;blog
post&lt;/a&gt;
on a simple JP2 file structure checker. This led to some interesting
online discussions on JP2 validation. Some people asked me about the
feasibility of expanding the tool to a full-fledged JP2 validator.
Despite some initial reservations, I eventually decided to dedicate a
couple of weeks to writing a rough prototype. The first results of this
work are now ready in the form of the &lt;em&gt;jpylyzer&lt;/em&gt; tool. Although I
initially intended to limit its functionality to validation (i.e.
verification against the format specifications), I quickly realised that
since validation would require the tool to extract and verify all header
properties anyway, it would make little sense not to include this
information in its output. As a result, &lt;em&gt;jpylyzer&lt;/em&gt; is both a validator
and a properties extractor.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;validation-in-a-nutshell&quot;&gt;Validation in a nutshell&lt;/h2&gt;

&lt;p&gt;It is beyond the scope of this blog post to provide an in-depth
description of how &lt;em&gt;jpylyzer&lt;/em&gt; validates a JP2 file. This will all be
covered in detail by a comprehensive user manual, which I will try to
write over the following weeks. For now I will restrict myself to a very
brief overview. First of all, it is helpful here to know that internally
a JP2 file is made up of a number of building blocks that are called
‘boxes’. This is illustrated by the figure below:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/images/2011/12/jp2Boxes.png&quot; alt=&quot;JP2 Boxes diagram&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Some of these boxes are required (indicated by solid lines in the
figure), whereas others (depicted with dashed lines) are optional. Some
boxes are ‘superboxes’ that contain other boxes. A number of boxes can
have multiple instances, whereas others are always unique. In addition,
the order in which the boxes may appear in a JP2 file is subject to
certain restrictions. This is all defined by the
&lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;standard&lt;/a&gt;. At the highest
level, &lt;em&gt;jpylyzer&lt;/em&gt; parses the box structure of a file and checks whether
it follows the standard. At a lower level, the information that is
contained within the boxes is often subject to restrictions as well. For
instance, the header field that defines how the colour space of an image
is specified only has two legal values; any other value is meaningless
and would therefore invalidate the file. Finally, there are a number of
interdependencies between property values. For instance, if the value of
the ‘Bits Per Component’ field of an image equals 255, this implies that
the JP2 Header box contains a ‘Bits Per Component’ box. There are
numerous other examples; the important thing here is that I have tried
to make &lt;em&gt;jpylyzer&lt;/em&gt; as exhaustive as possible in this regard.&lt;/p&gt;

&lt;p&gt;It is also worth pointing out that &lt;em&gt;jpylyzer&lt;/em&gt; checks whether any
embedded ICC profiles are actually allowed, as JP2 has a number of
restrictions in this regard. There is a slight (intentional) deviation
from the standard here, as an
&lt;a href=&quot;http://jpeg2000wellcomelibrary.blogspot.com/2011/04/guest-post-color-in-jp2.html&quot;&gt;amendment&lt;/a&gt;
to the standard is currently in preparation that will allow the use of
“display device” profiles in JP2. The current version of &lt;em&gt;jpylyzer&lt;/em&gt; is
already anticipating this change, and will consider JP2s that contain
such ICC profiles valid (provided that they do not contain any other
errors of course).&lt;/p&gt;

&lt;h2 id=&quot;whats-not-included-yet&quot;&gt;What’s not included yet&lt;/h2&gt;

&lt;p&gt;As this is a first prototype, &lt;em&gt;jpylyzer&lt;/em&gt; is still a work in progress.
Although most aspects of the JP2 file format are covered, a few things
are still missing at this stage:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Support of the Palette and Component Mapping boxes (which are
optional sub-boxes in the JP2 Header Box) is not included yet. The
current version of &lt;em&gt;jpylyzer&lt;/em&gt; recognises these boxes, but doesn’t
perform any analyses on them. This will change in upcoming versions.&lt;/li&gt;
  &lt;li&gt;The IPR, XML, UUID and UUID Info boxes are not yet supported either.&lt;/li&gt;
  &lt;li&gt;The analysis and validation of the image codestream is still
somewhat limited. Currently &lt;em&gt;jpylyzer&lt;/em&gt; reads and validates the
required parts of the main codestream header (for those who are in
the know on this: the SIZ, COD and QCD markers). It also checks if
the information in the codestream header is consistent with the JP2
image header (the information in both headers is partially
redundant). Finally, it loops through all tile parts in an image,
and checks if the length (in bytes) of each tile-part is consistent
with the markers that delineate the start and end of each tile-part
in the codestream. This is particularly useful for detecting certain
types of image corruption where one or more bytes are missing from
the codestream (either at the end or in the middle).&lt;/li&gt;
  &lt;li&gt;For now only codestream comments that consist solely of ASCII
characters are reported. As the standard permits the use of
non-ASCII characters of the Latin (ISO/IEC 8859-15) character set,
this means that codestream comments that contain e.g. accent
characters are currently not reported by &lt;em&gt;jpylyzer&lt;/em&gt;. This will
change in upcoming versions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;downloads&quot;&gt;Downloads&lt;/h2&gt;

&lt;p&gt;You can download the source code of &lt;em&gt;jpylyzer&lt;/em&gt; from the following
location&lt;sup id=&quot;fnref:1&quot;&gt;&lt;a href=&quot;#fn:1&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot; role=&quot;doc-noteref&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/bitsgalore/jpylyzer/&quot;&gt;https://github.com/bitsgalore/jpylyzer/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This requires &lt;a href=&quot;http://www.python.org/download/releases/2.7.2/&quot;&gt;Python
2.7&lt;/a&gt;, or &lt;a href=&quot;http://www.python.org/getit/releases/3.2/&quot;&gt;Python
3.2&lt;/a&gt; or more recent. A word
of warning though: due to a number of reasons the source code ended up
somewhat clumsy and unnecessarily verbose (one colleague even remarked
that looking at it induced nostalgic memories of good old GW-BASIC!).
With some major refactoring the overall length of the code could
probably be reduced to half its current size, and I may have a go at
this at some later point. For now it’ll have to do as it is!&lt;/p&gt;

&lt;p&gt;I also created some Windows binaries for those who do not want to
install Python. Just follow the link below, download the ZIP file and
extract it to an empty directory. Then simply use ‘jpylyzer.exe’
directly on the command line:&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/bitsgalore/jpylyzer/downloads&quot;&gt;https://github.com/bitsgalore/jpylyzer/downloads&lt;/a&gt;&lt;/p&gt;

&lt;h2 id=&quot;a-final-word&quot;&gt;A final word&lt;/h2&gt;

&lt;p&gt;Keep in mind that the current version of &lt;em&gt;jpylyzer&lt;/em&gt; is still a
prototype. There may (and probably will) be unresolved bugs, and it
really shouldn’t be used in any operational workflows at this stage. So
far I have tested it with a range of JPEG 2000 images (mostly JP2, but
also some JPX), including some images that I deliberately corrupted. No
matter how corrupt an image is, this should never cause &lt;em&gt;jpylyzer&lt;/em&gt; to
crash. Therefore, it would be extremely useful if people could test the
tool on their worst and weirdest images, and report back in case of any
unexpected results. For early 2012 I’m also planning to write a
comprehensive user guide that gives some more details on the validation
process, as well as an explanation on the reported properties. Support
of the JP2 boxes that are currently missing will also follow around that
time. Meanwhile, any feedback on the current prototype is highly
appreciated!&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2011/12/14/prototype-jp2-validator-and-properties-extractor/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
  &lt;ol&gt;
    &lt;li id=&quot;fn:1&quot;&gt;
      &lt;p&gt;Postscript, February 2019 - all links in this section point to an insanely outdated version of &lt;em&gt;jpylyzer&lt;/em&gt;! Don’t use them, but go to the &lt;em&gt;jpylyzer&lt;/em&gt; web site at &lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;http://jpylyzer.openpreservation.org/&lt;/a&gt; instead (unless you’re some future software historian with an unlikely interest in the history of &lt;em&gt;jpylyzer&lt;/em&gt;). &lt;a href=&quot;#fnref:1&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
    &lt;/li&gt;
  &lt;/ol&gt;
&lt;/div&gt;
</description>
                <link>https://bitsgalore.org/2011/12/14/prototype-jp2-validator-and-properties-extractor</link>
                <guid>https://bitsgalore.org/2011/12/14/prototype-jp2-validator-and-properties-extractor</guid>
                <pubDate>2011-12-14T00:00:00+01:00</pubDate>
        </item>

        <item>
                <title>Evaluation of identification tools&#58; first results from SCAPE</title>
                <description>&lt;p&gt;As I already briefly mentioned in a &lt;a href=&quot;/2011/07/11/improved-identification-xml-python-experiment&quot;&gt;previous blog
post&lt;/a&gt;,
one of the objectives of the
&lt;a href=&quot;http://www.scape-project.eu&quot;&gt;SCAPE&lt;/a&gt; project is to develop an
architecture that will enable large scale characterisation of digital
file objects. As a first step, we are evaluating existing
characterisation tools. The overall aim of this work is twofold. First,
we want to establish which tools are suitable candidates for inclusion
in the SCAPE architecture. As the enhancement of existing tools is
another goal of SCAPE, the evaluation is also aimed at getting a better
idea of the specific strengths and weaknesses of each individual tool.
The outcome of this will be helpful for deciding what modifications and
improvements are needed. Also, many of these tools are widely used
outside of the SCAPE project, which means that the results will most
likely be relevant to a wider audience (including the original tool
developers).&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;evaluation-of-identification-tools&quot;&gt;Evaluation of identification tools&lt;/h2&gt;

&lt;p&gt;Over the last months, work on this has focused on format identification
tools. This has resulted in a
&lt;a href=&quot;https://zenodo.org/record/840345&quot;&gt;report&lt;/a&gt;
which is attached with this blog post. We have evaluated the following
tools:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;http://sourceforge.net/apps/mediawiki/droid/index.php?title=Main_Page&quot;&gt;DROID&lt;/a&gt;
6.0&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://github.com/openplanets/fido&quot;&gt;FIDO&lt;/a&gt; 0.9&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://www.darwinsys.com/file/&quot;&gt;Unix File Utility&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;http://code.google.com/p/fits/&quot;&gt;FITS&lt;/a&gt; 0.5&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;https://bitbucket.org/jhove2/main/wiki/Home&quot;&gt;JHOVE2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All tools were evaluated against a set of 22 criteria. Extensive testing
using real data has been a key part of the work. One area which, I
think, we haven’t been able to tackle sufficiently so far is the
accuracy of the tools. This is problematic, since it would require a
test corpus where the format of each file object is known &lt;em&gt;a priori&lt;/em&gt;. In
most large data sets this information will be derived from the very same
tools that we are trying to test, so we need to see if we can say
anything meaningful about this in a follow-up.&lt;/p&gt;

&lt;h2 id=&quot;involvement-of-tool-developers&quot;&gt;Involvement of tool developers&lt;/h2&gt;

&lt;p&gt;Over the previous months we’ve been sending out earlier drafts of this
document to the developers of DROID, FIDO, FITS and JHOVE2, and we have
received a lot of feedback to this. In the case of FIDO, a new version
is underway, and this should correct most (if not all) of the problems
that are mentioned in the report. For the other tools we have also
received confirmation that some of the found issues will be fixed in
upcoming releases.&lt;/p&gt;

&lt;h2 id=&quot;status-of-the-report-and-future-work&quot;&gt;Status of the report and future work&lt;/h2&gt;

&lt;p&gt;The attached report should be seen as a living document. There will
probably be one or more updates at some later point, and we may decide
to include more tests using additional data. Meanwhile, as always, we
appreciate any of your feedback on this!&lt;/p&gt;

&lt;h2 id=&quot;link-to-report&quot;&gt;Link to report&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://zenodo.org/record/840345&quot;&gt;Evaluation of characterisation tools – Part 1:
Identification&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2011/09/21/evaluation-identification-tools-first-results-scape/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2011/09/21/evaluation-identification-tools-first-results-scape</link>
                <guid>https://bitsgalore.org/2011/09/21/evaluation-identification-tools-first-results-scape</guid>
                <pubDate>2011-09-21T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>A simple JP2 file structure checker</title>
                <description>&lt;p&gt;Over the last few weeks I’ve been working on the design of a workflow
that the KB is planning to use for the migration of a collection of
(mostly old) TIFF images to JP2. One major risk of such a migration is
that hardware failures during the migration process may result in
corrupted images. For instance, one could imagine a brief network or
power interruption that occurs while an image is being written to disk.
In that case data may be missing from the written file. Ideally we would
be able to detect such errors using format validation tools such as
&lt;a href=&quot;http://hul.harvard.edu/jhove/&quot;&gt;JHOVE&lt;/a&gt;. Some time ago Paul Wheatley
reported that the BL at some point were dealing with corrupted,
incomplete JP2 files that were nevertheless deemed “well-formed and
valid” by JHOVE. So I started doing some experiments in which I
deliberately butchered up some images, and subsequently checked to what
extent existing tools would detect this.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;I started out with removing some trailing bytes from a lossily
compressed JP2 image. As it turned out, I could remove most of the image
code stream (reducing the original 2 MB image to a mere 4 kilobytes!),
but JHOVE would still say the file was “well-formed and valid”. I was
also able to open and render these files with viewer applications such
as Adobe Photoshop, Kakadu’s viewer and Irfanview. The behaviour of the
viewer apps isn’t really a surprise, since the ability to render an
image without having to load the entire code stream is actually one of
the features that make JPEG 2000 so interesting for many access
applications. JHOVE’s behaviour was a bit more surprising, and perhaps
slightly worrying.&lt;/p&gt;

&lt;h2 id=&quot;jp2structcheck-tool&quot;&gt;jp2StructCheck tool&lt;/h2&gt;

&lt;p&gt;This made me wonder about a way to detect incomplete code streams in JP2
files. A quick glance at the
&lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;standard&lt;/a&gt; revealed that
image code streams should always be terminated by a two-byte ‘end of
codestream marker’. As this is something that is straightforward to
check, I fired up &lt;a href=&quot;http://www.python.org/&quot;&gt;Python&lt;/a&gt; and ended up writing
a very simple &lt;a href=&quot;https://github.com/bitsgalore/jp2StructCheck&quot;&gt;JP2 file structure
checker&lt;/a&gt;. Since the image
code stream in JP2 does not have to be located at the end of the file
(even though it usually is), it is necessary to do a superficial parsing
of JP2’s ‘box’ structure (which is documented
&lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;here&lt;/a&gt;). So I thought I
might as well include an additional check that verifies if the JP2
contains all required boxes.&lt;/p&gt;

&lt;p&gt;In brief, when &lt;em&gt;jp2StructCheck&lt;/em&gt; analyses a file, it first parses the
top-level box structure, and collects the unique identifiers (or marker
codes) of all boxes. If it encounters the box that contains the code
stream, it checks if the code stream is terminated by a valid
end-of-codestream marker. Finally, it checks if the file contains all
the compulsory/required top-level boxes. These are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;JPEG 2000 signature box&lt;/li&gt;
  &lt;li&gt;File Type box&lt;/li&gt;
  &lt;li&gt;JP2 Header box&lt;/li&gt;
  &lt;li&gt;Contiguous Codestream box&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In order to test the box checking mechanism I did some additional image
butchering, where I deliberately changed the tags of existing boxes so
that they wouldn’t be recognised. When I subsequently ran these images
through JHOVE, this revealed some additional surprises. For instance,
after changing the markers of the Contiguous Codestream box or even the
JP2 Header box (which effectively makes them unrecognisable), JHOVE
would still report these images as “well-formed and valid” (although in
the case of the missing JP2 Header box JHOVE did report an error).&lt;/p&gt;

&lt;h2 id=&quot;limitations-of-jp2structcheck&quot;&gt;Limitations of jp2StructCheck&lt;/h2&gt;

&lt;p&gt;It is important to note here that &lt;em&gt;jp2StructCheck&lt;/em&gt; only checks the
top-level boxes. In case of a superbox (which is a box that contains
child boxes), it does not recurse into its child boxes. For example, it
does not check if a JP2 Header box (which is a superbox) contains a Bits
Per Component Box (which is required by the standard). So the scope of
the tool is limited to a rather superficial check of the general file
structure. It is &lt;em&gt;not&lt;/em&gt; a JP2 validator, and it is certainly not a
replacement for JHOVE (which performs a more in-depth analysis)! The
main scope is to be able to detect certain types of file corruption that
may occur as a result of hardware failure (e.g. network interruptions)
during the creation of an image.&lt;/p&gt;

&lt;p&gt;In addition, the fact that a code stream is terminated by and
end-of-codestream marker is no guarantee that the code stream is
complete. For instance, if due to some hardware failure some part of the
middle of the codestream is not written, &lt;em&gt;jp2StructCheck&lt;/em&gt; will not
detect this! It may be possible to improve the level of error detection
by including additional codestream markers. This is something I might
have a look at at some later point.&lt;/p&gt;

&lt;h2 id=&quot;downloads&quot;&gt;Downloads&lt;/h2&gt;

&lt;p&gt;I created a &lt;a href=&quot;https://github.com/bitsgalore/jp2StructCheck&quot;&gt;Github
repository&lt;/a&gt; that contains
the source code of &lt;em&gt;jp2StructCheck&lt;/em&gt;, some documentation, and a small
data set with some test images.&lt;/p&gt;

&lt;p&gt;As some people may not want to install Python on their system, I also
created a &lt;a href=&quot;https://github.com/downloads/bitsgalore/jp2StructCheck/jp2StructCheck31082011distWin32.zip&quot;&gt;binary distribution that should work on most Windows
systems&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The documentation (in PDF format) is
&lt;a href=&quot;https://github.com/downloads/bitsgalore/jp2StructCheck/jp2StructCheck.pdf&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally, use &lt;a href=&quot;https://github.com/downloads/bitsgalore/jp2StructCheck/testImages.zip&quot;&gt;this
link&lt;/a&gt;
to download the test images.&lt;/p&gt;

&lt;h2 id=&quot;final-notes&quot;&gt;Final notes&lt;/h2&gt;

&lt;p&gt;I’m curious to hear if anyone finds &lt;em&gt;jp2StructCheck&lt;/em&gt; useful at all, so
please feel free to use the comment fields below for your feedback
(including reports on any bugs that may exist).&lt;/p&gt;

&lt;h2 id=&quot;post-script-february-2019&quot;&gt;Post script, February 2019&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;jp2StructCheck&lt;/em&gt; tool is superseded by &lt;a href=&quot;http://jpylyzer.openpreservation.org/&quot;&gt;&lt;em&gt;jpylyzer&lt;/em&gt;&lt;/a&gt;
(of which &lt;em&gt;jp2StructCheck&lt;/em&gt; was an early precursor). Unlike &lt;em&gt;jp2StructCheck&lt;/em&gt;, &lt;em&gt;jpylyzer&lt;/em&gt; is a
full-fledged validator for the &lt;em&gt;JP2&lt;/em&gt; format.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2011/09/01/simple-jp2-file-structure-checker/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2011/09/01/simple-jp2-file-structure-checker</link>
                <guid>https://bitsgalore.org/2011/09/01/simple-jp2-file-structure-checker</guid>
                <pubDate>2011-09-01T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Improved identification of XML&#58; a Python experiment</title>
                <description>&lt;p&gt;As a part of the &lt;a href=&quot;http://www.scape-project.eu&quot;&gt;SCAPE&lt;/a&gt; project, I’m
currently heavily involved in the evaluation of various file format
identification tools. The overall aim of this work is to determine which
tools are suitable candidates for inclusion in the SCAPE architecture.
In addition, we’re also trying to get a better idea of each tool’s
specific strengths and weaknesses, which will hopefully serve as useful
input to the developers community. We’re actually planning to publish
the first results of this work on the OPF blog some time soon, so you
may want to keep your eyes peeled for that.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;identification-using-byte-signatures&quot;&gt;Identification using byte signatures&lt;/h2&gt;

&lt;p&gt;In this blog entry I will focus on one particular area in which most
identification tools appear to be struggling: the identification of XML
files. Most identification tools try to establish a file’s format by
looking for characteristic byte sequences, or ‘signatures’. Examples of
tools that use this approach are
&lt;a href=&quot;http://sourceforge.net/projects/droid/&quot;&gt;DROID&lt;/a&gt;,
&lt;a href=&quot;https://github.com/openplanets/fido&quot;&gt;Fido&lt;/a&gt; and the &lt;a href=&quot;http://darwinsys.com/file/&quot;&gt;Unix File
tool&lt;/a&gt;. Signature-based identification works
well for most binary formats, but for text-based formats the results are
often less reliable. This also applies to XML. Signature-based tools
typically identify XML by the presence of an XML declaration, which, in
its simplest form, looks like this:&lt;/p&gt;

&lt;div class=&quot;language-xml highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cp&quot;&gt;&amp;lt;?xml version=&quot;1.0&quot;?&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The problem is that not all XML files actually contain an XML
declaration. Also, the use of an XML declaration is not mandatory.
The&lt;a href=&quot;http://www.w3.org/TR/xml/&quot;&gt;XML specification&lt;/a&gt; states that for a file
to qualify as “valid” XML it &lt;em&gt;should&lt;/em&gt; contain the declaration. This
merely means that the use of the declaration is recommended (which
follows from the use of the word &lt;em&gt;“should”&lt;/em&gt; and not &lt;em&gt;“must”&lt;/em&gt;). XML files
that don’t contain the declaration are by definition not “valid”, but
they may still be “well-formed”.&lt;/p&gt;

&lt;p&gt;However, if (part of) the declaration is used as a signature, this means
that any files that don’t have the declaration will not be identified as
XML by any of the above tools. This is exactly what happened in our
tests for DROID, Fido and the Unix File tool. DROID and Fido simply
leave such files unidentified, whereas the Unix File tool identifies
them as ‘plain text’ (which, of course, is correct at a lower level, but
not very helpful). Unfortunately, such files are pretty common in
practice.&lt;/p&gt;

&lt;h2 id=&quot;using-an-xml-parser-to-identify-xml&quot;&gt;Using an XML parser to identify XML&lt;/h2&gt;

&lt;p&gt;A different approach to identify these files would be to run them
through an XML parser. If a parser can make sense of a file’s contents
this means it is well-formed (but not necessarily valid!) XML. In all
other cases, it’s something else.&lt;/p&gt;

&lt;p&gt;I ended up writing some &lt;a href=&quot;http://www.python.org/&quot;&gt;Python&lt;/a&gt; code to see how
this would work in practice. I first created two re-usable Python
functions that check any given file for well-formedness using Python’s
highly performant ‘expat’ parser (based on &lt;a href=&quot;http://code.activestate.com/recipes/52256-check-xml-well-formedness/&quot;&gt;original code by Farhad
Fouladi&lt;/a&gt;).
I then wrote a simple command-line application around it, which is
called “isXMLDemo.py”. The demo can be used to analyse one file at a
time, or, alternatively, all files in a directory tree. The output is a
formatted text file that contains, for each analysed file, the
identification result (which is either “isXML” or “noXML”).&lt;/p&gt;

&lt;p&gt;I was surprised at how fast the XML parsing actually is. To give an
indication, I used “isXMLDemo.py” to analyse a 1.15 GB dataset that
contains 11,892 file objects. I ran this experiment under Microsoft
Windows XP Professional using a PC with a 3 Ghz GenuineIntel processor
and 1 GB RAM. The total time needed to analyse all files was about 90
seconds, which corresponds to an average throughput of about 131 files
per second.&lt;/p&gt;

&lt;h2 id=&quot;xml-parsing-in-fido&quot;&gt;XML parsing in Fido?&lt;/h2&gt;

&lt;p&gt;Since the core functions that do the actual XML parsing are completely
reusable, it would probably be fairly easy to incorporate this kind of
identification into Fido. This would obviously have some impact on
Fido’s performance, but not by very much. XML parsing could also be
offered as an option. In that case, the decision on whether to parse or
not to parse is up to the user.&lt;/p&gt;

&lt;p&gt;An obvious limitation of this approach is that it will not identify XML
that is not well-formed. Also, it makes the line between identification
and validation somewhat blurry, but in practical terms that shouldn’t be
a real problem. Finally, one could argue that knowing that a file
contains XML is not very informative at all, since it is merely a
container for something else.  This was the subject of an &lt;a href=&quot;http://www.openplanetsfoundation.org/blogs/2011-02-17-new-direction-file-characterisation&quot;&gt;earlier blog
post by Asger
Blekinge&lt;/a&gt;.
However, even then, identifying the container is a necessary first step,
and one that the current tools don’t seem to be too good at yet.&lt;/p&gt;

&lt;h2 id=&quot;demo&quot;&gt;Demo&lt;/h2&gt;

&lt;p&gt;For those who want to do some tests for themselves, I have &lt;strike&gt;attached the
demo script to this post&lt;/strike&gt; &lt;a href=&quot;https://github.com/bitsgalore/isXMLDemo&quot;&gt;uploaded the demo script to GitHub&lt;/a&gt;. The &lt;strike&gt;ZIP file&lt;/strike&gt; repository contains the Python script with its  documentation in PDF format. If you end up with any interesting
results, or if you have any other thoughts on this: please report back
in the comments!&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2011/07/11/improved-identification-xml-python-experiment/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2011/07/11/improved-identification-xml-python-experiment</link>
                <guid>https://bitsgalore.org/2011/07/11/improved-identification-xml-python-experiment</guid>
                <pubDate>2011-07-11T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Paper on JPEG 2000 for preservation</title>
                <description>&lt;p&gt;The JPEG 2000 compression standard is steadily becoming more and more
popular in the archival community. Several large (national) libraries
are now using the &lt;a href=&quot;http://www.jpeg.org/public/15444-1annexi.pdf&quot;&gt;JP2
format&lt;/a&gt; (which
corresponds to Part 1 of the standard) as the master format in mass
digitisation projects. However, some aspects of the JP2 file format are
defined in ways that are open to multiple interpretations. This applies
to the embedding of &lt;a href=&quot;http://en.wikipedia.org/wiki/ICC_profile&quot;&gt;ICC
profiles&lt;/a&gt; (which
are used to define colour space information), and the definition of grid
resolution. This situation has lead to a number of interoperability
issues that are potential risks for long-term preservation.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;h2 id=&quot;paper&quot;&gt;Paper&lt;/h2&gt;

&lt;p&gt;I recently addressed this in a
&lt;a href=&quot;http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html&quot;&gt;paper&lt;/a&gt;
that has just been published in &lt;a href=&quot;http://www.dlib.org/&quot;&gt;D-Lib
Magazine&lt;/a&gt;. An earlier version of the
paper was used as a ‘defect report’ by the &lt;a href=&quot;http://www.jpeg.org/&quot;&gt;JPEG
committee&lt;/a&gt;. The paper gives a detailed
description of the problems, and shows to what extent the most
widely-used JPEG 2000 encoders are affected by these issues.&lt;/p&gt;

&lt;h2 id=&quot;solutions&quot;&gt;Solutions&lt;/h2&gt;

&lt;p&gt;The paper also suggests some possible solutions. Importantly, none of
the found problems require any changes to the actual file format;
rather, some features should simply be defined slightly differently. In
the case of the ICC profile issue this boils down to allowing a widely
used class of ICC profiles that are currently prohibited in JPEG 2000.
The resolution issue could be fixed by a more specific definition of the
existing resolution fields.&lt;/p&gt;

&lt;h2 id=&quot;amendment-to-the-standard&quot;&gt;Amendment to the standard&lt;/h2&gt;

&lt;p&gt;Both issues will be addressed in an amendment to the standard. Rob
Buckley provides more details on this (along with some interesting
background information on colour space support in JP2) in a recent &lt;a href=&quot;http://jpeg2000wellcomelibrary.blogspot.com/2011/04/guest-post-color-in-jp2.html&quot;&gt;blog
entry on the Wellcome Library’s JPEG 2000
blog&lt;/a&gt;.
As Rob puts it:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The final outcome of all this will be a JP2 file format standard that
aligns with current practice; supports RGB spaces such as Adobe RGB
1998, ProPhoto RGB and eci RGB v2; and provides a smooth migration path
from TIFF masters as JP2 increasingly becomes used as an image
preservation format.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, some relatively small adjustments to the standard could result in a
significant improvement of the suitability of JP2 for preservation
purposes.&lt;/p&gt;

&lt;p&gt;Since various institutions are using JPEG 2000 &lt;em&gt;now&lt;/em&gt;, the paper also
provides some practical recommendations that may help in mitigating the
risks for existing collections.&lt;/p&gt;

&lt;h2 id=&quot;link-to-paper&quot;&gt;Link to paper&lt;/h2&gt;
&lt;p&gt; 
&lt;a href=&quot;http://www.dlib.org/dlib/may11/vanderknijff/05vanderknijff.html&quot;&gt;JPEG 2000 for Long-term Preservation: JP2 as a Preservation
Format&lt;/a&gt;&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;https://openpreservation.org/blog/2011/06/06/paper-jpeg-2000-preservation-9/&quot;&gt;Open Preservation Foundation blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2011/06/06/paper-jpeg-2000-preservation-9</link>
                <guid>https://bitsgalore.org/2011/06/06/paper-jpeg-2000-preservation-9</guid>
                <pubDate>2011-06-06T00:00:00+02:00</pubDate>
        </item>

        <item>
                <title>Ensuring the suitability of JPEG 2000 for preservation</title>
                <description>&lt;p&gt;In my &lt;a href=&quot;http://www.dpconline.org/component/docman/doc_download/526-jp2knov2010vanderkniff&quot;&gt;presentation&lt;/a&gt;
 during the &lt;a href=&quot;http://blog.wellcomelibrary.org/2010/11/wellcome-trust-hosts-jpeg-2000-seminar/&quot;&gt;Wellcome Trust’s JPEG 2000 seminar&lt;/a&gt; I discussed the suitability of JPEG 2000
(and more specifically its JP2 format) for long-term preservation. I
highlighted the erroneous restriction in the JP2 (and JPX) format
specification that only allows ICC profiles of the ‘input’ class to be
used. This effectively prohibits the use of all working colour spaces
such as Adobe RGB, which are defined using ‘display device’ profiles. I
also showed how different software vendors interpret the format
specification in subtly different ways, and how such issues can create
problems in the long term, such as the loss of colour space and
resolution information after some future migration.&lt;/p&gt;

&lt;!-- more --&gt;

&lt;p&gt;This leads us to the question to which extent we can predict a specific
file format’s suitability for long-term preservation. The answer is not
that straightforward. The Library of Congress assesses file formats
against 7 &lt;a href=&quot;http://www.digitalpreservation.gov/formats/sustain/sustain.shtml&quot;&gt;‘sustainability factors’&lt;/a&gt;,
whereas the National Archives have formulated &lt;a href=&quot;http://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf&quot;&gt;a list of 12 criteria&lt;/a&gt;.
It is beyond the scope of this blog post to present a detailed analysis
of the extent to which JP2 lives up to either set of criteria. However,
it is interesting to have a look at whether these criteria could have
been helpful in identifying the issues covered by my presentation.&lt;/p&gt;

&lt;h2 id=&quot;format-specifications&quot;&gt;Format specifications&lt;/h2&gt;

&lt;p&gt;First, both the LoC’s ‘sustainability factors’ and the TNA criteria
acknowledge the importance of having published specifications of a file
format. The LoC uses a ‘Disclosure’ factor, which refers to “the
existence of complete documentation, preferably subject to external
expert evaluation”. TNA take this one step further by also defining a
‘Documentation Quality’ criterion, which expresses the degree to which
documentation is comprehensive, accurate and comprehensible. This last
criterion largely covers the JPEG 2000 ICC issue, although it’s
questionable how useful this would have been to identify it &lt;em&gt;a priori&lt;/em&gt;.
A problem with errors and ambiguities in format specifications is that
they can be incredibly easy to overlook, and you may only become aware
of them after discovering that different software products interpret the
specifications in slightly different ways.&lt;/p&gt;

&lt;h2 id=&quot;adoption&quot;&gt;Adoption&lt;/h2&gt;

&lt;p&gt;Formats that are widely used are typically well supported by an array of
software tools, and such formats are unlikely to disappear into
obsolescence. TNA expresses this through a ‘Ubiquity’ criterion, which
essentially reflects a file format’s overall popularity. The definition
of the LoC’s ‘Adoption’ factor includes a list of criteria that can be
used as “evidence of adoption”. The first set of criteria here includes
“bundling of tools with personal computers, native support in Web
browsers or market-leading content creation tools, and the existence of
many competing products for creation, manipulation, or rendering of
digital objects in the format”. Note that JP2 isn’t doing particularly
well when measured against any of these criteria. However, the LoC list
adds that “a format that has been reviewed by other archival
institutions and accepted as a preferred or supported archival format
also provides evidence of adoption”. This certainly seems to be the case
for JP2. But how relevant is this, really? Going back to the ICC
profiles issue: the JP2 file format has been around for about 10 years
now, and its acceptance by the archival community has been growing
steadily over the last 5 years or so. Yet, this whole issue seems to
have gone unnoticed in the archival community for all those years, and I
think this is slightly worrying.&lt;/p&gt;

&lt;p&gt;Now let’s imagine for a moment that JP2 would have been picked up by the
digital photography and graphic design communities. For such uses the
ability to do proper colour management is a basic prerequisite, and
limiting the support of ICC profiles to the ‘input’ class would have
made the format virtually useless to these user communities. My guess is
that in this -entirely fictional- scenario, the format specification
would have either improved quickly (based on feedback from the user
community), or the respective user communities would have simply stopped
using the format altogether. The problem here seems to be that very few
people in the archiving community are even aware of such things as
colour spaces and colour management, let alone their importance within
the context of preservation. With more established formats such as TIFF
this may not be as much of a problem, if only because TIFF has been
‘road tested’ for decades by the photography and graphic design
communities. As an archiving community we cannot fall back to any
similar ‘road testing’ in the case of JP2. And this brings me to my next
point.&lt;/p&gt;

&lt;h2 id=&quot;importance-of-hands-on-experience&quot;&gt;Importance of hands-on experience&lt;/h2&gt;

&lt;p&gt;Preservation criteria such as those of the LoC or TNA are invaluable for
assessing the suitability of a format for preservation, but I believe it
is equally important to have actual hands-on experience with the tools
that are used for creating, modifying, and reading the format. For
instance, the TNA criteria use the &lt;em&gt;number&lt;/em&gt; of software tools that
support a given format as an indicator for the extent of current
software support of that format. But knowing the &lt;em&gt;number&lt;/em&gt; of tools says
nothing about how good or useful these tools actually are! In the case
of JP2, quite a large number of (mostly free or open-source) tools exist
that, under the hood, are using the open &lt;a href=&quot;http://www.ece.uvic.ca/~mdadams/jasper/&quot;&gt;JasPer&lt;/a&gt;
library. JasPer is known to have performance and stability issues that
make it unsuitable for most professional applications (for which, I
should emphasise, it was never developed in the first place!). These
issues affect &lt;em&gt;all&lt;/em&gt; software tools that are using JasPer. So, only
counting the number of available tools may be simply missing the point
without incorporating any additional quality criteria. But how would you
define these?&lt;/p&gt;

&lt;p&gt;Part of the answer, I think, is that assessing a format’s suitability
for long-term preservation is not a purely top-down process. Most of the
software-related issues that I showed in my presentation were found by
simply experimenting with actual files, encoders and characterisation
tools: convert a TIFF to JP2; convert it back to TIFF; use existing
metadata-extraction and characterisation tools such as &lt;a href=&quot;http://www.sno.phy.queensu.ca/~phil/exiftool/&quot;&gt;ExifTool&lt;/a&gt;
and &lt;a href=&quot;http://hul.harvard.edu/jhove/&quot;&gt;JHOVE&lt;/a&gt; to analyse the
in- and output files; try to understand the output of these tools;
compare the output before and after the conversion, and so on. Such
experiments are extremely useful for getting a feel for the strengths
and weaknesses of specific software tools, and they can reveal problems
that are not readily captured by pre-defined criteria. In some cases,
their results may be used to refine existing criteria, or even add new
ones.&lt;/p&gt;

&lt;h2 id=&quot;final-notes-on-preservation-criteria&quot;&gt;Final notes on preservation criteria&lt;/h2&gt;

&lt;p&gt;Although I wouldn’t downplay the importance of preservation criteria
such as those used by the LoC or TNA, I think it’s important to realise
that such criteria are largely based on theoretical considerations. In
most cases they are not based on any empirical data, and as a result
their predictive value is largely unknown. For example, &lt;a href=&quot;http://blog.dshr.org/2009/01/are-format-specifications-important-for.html&quot;&gt;an interesting
blog post by David Rosenthal&lt;/a&gt;
argues that preserving the specifications of a file
format doesn’t contribute anything to practical digital preservation.
According to Rosenthal, the availability of working open-source
rendering software is much more important, and he explains how “formats
with open source renderers are, for all practical purposes, immune from
format obsolescence”.&lt;/p&gt;

&lt;p&gt;This takes us directly to the lack of JPEG 2000-related activity in the
open source community, which I also referred to in my presentation.
Perhaps the best way to ensure sustainability of JPEG 2000 and the JP2
format would be to invest in a truly open JP2 software library, and
release this under a free software license. This could either take the
form of the development of a completely new library, or investing in the
improvement and further development of an existing one, such as &lt;a href=&quot;http://www.openjpeg.org/&quot;&gt;OpenJPEG&lt;/a&gt;.
This would require an investment from the archival community, but the
payoff may be well worth it.&lt;/p&gt;

&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;

&lt;p&gt;This blog entry was largely inspired by an e-mail discussion that was
started by Richard Clark, and in particular by a contribution to this
discussion by William Kilbride.&lt;/p&gt;

&lt;hr /&gt;

&lt;p&gt;Originally published at the &lt;a href=&quot;http://blog.wellcomelibrary.org/2010/12/guest-post-ensuring-the-suitability-of-jpeg-2000-for-preservation/&quot;&gt;Wellcome Library Blog&lt;/a&gt;&lt;/p&gt;
</description>
                <link>https://bitsgalore.org/2010/12/02/ensuring-suitability-of-jpeg</link>
                <guid>https://bitsgalore.org/2010/12/02/ensuring-suitability-of-jpeg</guid>
                <pubDate>2010-12-02T00:00:00+01:00</pubDate>
        </item>


</channel>
</rss>
