Escape from the phantom of the PDF

14 November 2024
Aquatint showing a graveyard scene in front of a church. At the center is a man in armour with a distressed expression on his face. To his left is a skeleton, and to his right a ghost.
A man in armour is confronted by a ghost and a skeleton. Aquatint. Wellcome Collection, Public Domain.

In a recent blog post, colleagues at the National Digital Preservation Services in Finland addressed an issue with PDF files that contain strings with octal escape sequences. These are not parsed correctly by JHOVE, and the resulting parse errors ultimately lead to (seemingly unrelated) validation errors. The authors argue that octal escape sequences present a preservation risk, as they may confuse other software besides JHOVE. Since this claim is not backed up by any evidence, here I put this to the test using 8 different PDF processing tools and libraries.

The Phantom of a PDF File

A few weeks ago I came across The Phantom of a PDF File, a blog post by Juha Lehtonen & Johan Kylander. In this post, they address an issue with a PDF file that gives an unexpected validation error in JHOVE. Digging into the PDF structure, they were able to track this down to the presence of octal escape sequences in the “Producer” field of the file’s Document Information Dictionary:

9 0 obj
<<
/Title (Boo)
/CreationDate (D:20241029134330Z00'00') 
/Producer (\376\377\000P\000D\000F\000 \000P\000h\000a\000n\000t\000o\000m\000\000)
>>
endobj

JHOVE cannot handle octal escape sequences properly, which leads to a parse error that ultimately results in JHOVE reporting a validation error that is completely unrelated to the “Producer” field. In their original post, the authors claimed that the way the octal escape sequences are used in this file is not allowed by the PDF specification. They also advised against the use of octal escape sequences, and to restrict the values in PDF metadata fields to plain ASCII, or otherwise UTF-16BE. Finally they argue that the presence of octal escape sequences1 “raises the risk of causing problems in the future”, and that “JHOVE probably is not the only software that will get confused” by this.

Is JHOVE the only software that gets confused by octal escape sequences?

After reading this post, I posted the link to this existing ticket on PDF parse issues. This immediately resulted in a response by Peter Wyatt of the PDF Association, who pointed out that, contrary to the claims made in the initial version of the post, the use of octal escape sequences in the “problematic” file is actually perfectly valid according to ISO 32000-1. In response to this comment, the authors corrected this in an update to their post. However, they seem to stick by their claim that “JHOVE probably is not the only software that will get confused” by octal escape sequences. Out of curiosity, I decided to put this to the test.

Enter the OPF Phantom

The original file contains two “Producer” fields: one in the Document Information Dictionary (which is the one we’re interested in here), and another one that is part of the XMP metadata. Since both have the same value (“PDF Phantom”), I changed the XMP value to “OPF Phantom” in a Hex editor. This way, we can easily distiguish between both fields. The modified file is available here.

Next, I ran the modified file through 8 different PDF tools and libraries, and inspected the result to see how the “Producer” field is handled. Below are the commands and results.

ExifTool

Command:

exiftool -X phantom_modified_xmp.pdf

Result:

<?xml version='1.0' encoding='UTF-8'?>
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'>

<rdf:Description rdf:about='phantom_modified_xmp.pdf'
  xmlns:et='http://ns.exiftool.org/1.0/' et:toolkit='Image::ExifTool 12.60'
  xmlns:ExifTool='http://ns.exiftool.org/ExifTool/1.0/'
  xmlns:System='http://ns.exiftool.org/File/System/1.0/'
  xmlns:File='http://ns.exiftool.org/File/1.0/'
  xmlns:PDF='http://ns.exiftool.org/PDF/PDF/1.0/'
  xmlns:XMP-x='http://ns.exiftool.org/XMP/XMP-x/1.0/'
  xmlns:XMP-pdf='http://ns.exiftool.org/XMP/XMP-pdf/1.0/'>
 <ExifTool:ExifToolVersion>12.60</ExifTool:ExifToolVersion>
 <System:FileName>phantom_modified_xmp.pdf</System:FileName>
 <System:Directory>.</System:Directory>
 <System:FileSize>5.9 kB</System:FileSize>
 <System:FileModifyDate>2024:11:09 00:22:05+00:00</System:FileModifyDate>
 <System:FileAccessDate>2024:11:09 00:22:46+00:00</System:FileAccessDate>
 <System:FileInodeChangeDate>2024:11:09 00:22:05+00:00</System:FileInodeChangeDate>
 <System:FilePermissions>-rw-rw-r--</System:FilePermissions>
 <File:FileType>PDF</File:FileType>
 <File:FileTypeExtension>pdf</File:FileTypeExtension>
 <File:MIMEType>application/pdf</File:MIMEType>
 <PDF:PDFVersion>1.4</PDF:PDFVersion>
 <PDF:Linearized>No</PDF:Linearized>
 <PDF:PageCount>1</PDF:PageCount>
 <PDF:Title>Boo</PDF:Title>
 <PDF:CreateDate>2024:10:29 13:43:30Z</PDF:CreateDate>
 <PDF:Producer>PDF Phantom</PDF:Producer>
 <XMP-x:XMPToolkit>Image::ExifTool 12.71</XMP-x:XMPToolkit>
 <XMP-pdf:Producer>OPF Phantom</XMP-pdf:Producer>
</rdf:Description>
</rdf:RDF>

ExifTool correctly decodes the octal escape sequences (PDF:Producer), and also extracts the XMP value (XMP-pdf:Producer).

Pdfcpu

Command:

pdfcpu info phantom_modified_xmp.pdf

Result:

         PDF version: 1.4
          Page count: 1
           Page size: 595.28 x 841.89 points
............................................
               Title: Boo
              Author: 
             Subject: 
        PDF Producer: PDF Phantom
     Content creator: 
       Creation date: D:20241029134330Z00'00'
   Modification date: 
............................................
              Tagged: No
              Hybrid: No
          Linearized: No
  Using XRef streams: No
Using object streams: No
         Watermarked: No
............................................
           Encrypted: No
         Permissions: Full access

Pdfcpu correctly decodes the octal escape sequences (PDF Producer).

pdfinfo (Poppler)

Command:

pdfinfo phantom_modified_xmp.pdf

Result:

Title:          Boo
Producer:       PDF Phantom
CreationDate:   Tue Oct 29 14:43:30 2024 CET
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      5906 bytes
Optimized:      no
PDF version:    1.4

Poppler correctly decodes the octal escape sequences (Producer).

VeraPDF

Command:

verapdf --off --extract phantom_modified_xmp.pdf

The result includes:

<informationDict>
  <entry key="Title">Boo</entry>
  <entry key="Producer">PDF Phantom#x000000</entry>
  <entry key="CreationDate">2024-10-29T13:43:30.000Z</entry>
</informationDict>

VeraPDF correctly decodes the octal escape sequences. Note the null character at the end (which is actually part of the object string).

Apache Tika

Command:

java -jar ~/tika/tika-app-2.9.2.jar phantom_modified_xmp.pdf

Result:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="pdf:docinfo:title" content="Boo"/>
<meta name="pdf:hasXFA" content="false"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="dcterms:created" content="2024-10-29T13:43:30Z"/>
<meta name="dc:format" content="application/pdf; version=1.4"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:hasCollection" content="false"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="dc:title" content="Boo"/>
<meta name="Content-Length" content="5906"/>
<meta name="pdf:hasMarkedContent" content="false"/>
<meta name="Content-Type" content="application/pdf"/>
<meta name="pdf:producer" content="OPF Phantom"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="resourceName" content="phantom_modified_xmp.pdf"/>
<meta name="pdf:hasXMP" content="true"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="pdf:docinfo:producer" content="PDF Phantom�"/>
<meta name="pdf:docinfo:created" content="2024-10-29T13:43:30Z"/>
<title>Boo</title>
</head>
<body><div class="page"><p/>
</div>
</body></html>

Tika correctly decodes the octal escape sequences (field pdf:docinfo:producer); like VeraPDF it shows the null character at the end of the string. Note that Tika also reports the XMP Producer value (field pdf:producer).

Qpdf

Command:

qpdf --json phantom_modified_xmp.pdf

Output contains:

"9 0 R": {
  "/CreationDate": "D:20241029134330Z00'00'",
  "/Producer": "PDF Phantom\u0000",
  "/Title": "Boo"
}

Qpdf correctly decodes the octal escape sequences, including the trailing null character.

Pdftk

Command:

pdftk phantom_modified_xmp.pdf dump_data

Result:

InfoBegin
InfoKey: CreationDate
InfoValue: D:20241029134330Z00&apos;00&apos;
InfoBegin
InfoKey: Producer
InfoValue: PDF Phantom
InfoBegin
InfoKey: Title
InfoValue: Boo
PdfID0: 71a810587639eb130aefddee35e3c49d
PdfID1: 71a810587639eb130aefddee35e3c49d
NumberOfPages: 1
PageMediaBegin
PageMediaNumber: 1
PageMediaRotation: 0
PageMediaRect: 0 0 595.276 841.89
PageMediaDimensions: 595.276 841.89

Pdftk correctly decodes the octal escape sequences.

PyMuPDF

I tested PyMuPDF with this simple test script:

import pprint
import pymupdf

myPDF = "phantom_modified_xmp.pdf"

doc = pymupdf.open(myPDF)
metadata = doc.metadata
pprint.pp(metadata)

Result:

{'format': 'PDF 1.4',
 'title': 'Boo',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': 'PDF Phantom',
 'creationDate': "D:20241029134330Z00'00'",
 'modDate': '',
 'trapped': '',
 'encryption': None}

PyMUPDF correctly decodes the octal escape sequences.

Conclusion

All the above tools and libraries were able to decode the octal escape sequences without any problems. So JHOVE’s behaviour really seems to be the exception here, rather than the rule. Octal escape sequences are both allowed by the standard, and widely supported by PDF processing tools. Concluding, I would argue that they don’t pose a significant (if at all) preservation risk.

Acknowledgments

Thanks are due to Peter Wyatt (PDF Association) for his helpful comments in this Github thread.

  1. Actually they refer to this as a “dual encoding”, but this term is confusing because the octal escape sequences aren’t really an “encoding” at all, but rather a part of the lexical definition of PDF literal string objects. See this follow-up comment by Peter Wyatt




Search

Tags

Archive

2024

November

October

March

2023

June

May

March

February

January

2022

November

June

April

March

2021

September

February

2020

September

June

April

March

February

2019

September

April

March

January

2018

July

April

2017

July

June

April

January

2016

December

April

March

2015

December

November

October

July

April

March

January

2014

December

November

October

September

August

January

2013

October

September

August

July

May

April

January

2012

December

September

August

July

June

April

January

2011

December

September

July

June

2010

December

Feeds

RSS

ATOM