Escape from the phantom of the PDF
In a recent blog post, colleagues at the National Digital Preservation Services in Finland addressed an issue with PDF files that contain strings with octal escape sequences. These are not parsed correctly by JHOVE, and the resulting parse errors ultimately lead to (seemingly unrelated) validation errors. The authors argue that octal escape sequences present a preservation risk, as they may confuse other software besides JHOVE. Since this claim is not backed up by any evidence, here I put this to the test using 8 different PDF processing tools and libraries.
-
Apache-tika
-
ExifTool
-
JHOVE
- Escape from the phantom of the PDF
- Multi-image TIFFs, subfiles and image file directories
- VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset
- Identification of PDF preservation risks with VeraPDF and JHOVE
- PDF processing and analysis with open-source tools
- Breaking WAVEs (and some FLACs too)
- Why can't we have digital preservation tools that just work?
- A simple JP2 file structure checker
-
PDF
- Escape from the phantom of the PDF
- VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset
- Identification of PDF preservation risks with VeraPDF and JHOVE
- On The Significant Properties of Spreadsheets
- PDF processing and analysis with open-source tools
- Policy-based assessment with VeraPDF - a first impression
- PDF/A as a preferred, sustainable format for spreadsheets?
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Why PDF/A validation matters, even if you don't have PDF/A
- When (not) to migrate a PDF to PDF/A
- Identification of PDF preservation risks: analysis of Govdocs selected corpus
- Identification of PDF preservation risks with Apache Preflight: the sequel
- What do we mean by "embedded" files in PDF?
- Identification of PDF preservation risks with Apache Preflight: a first impression
- PDF – Inventory of long-term preservation risks
-
preservation-risks
- Escape from the phantom of the PDF
- Multi-image TIFFs, subfiles and image file directories
- Identification of PDF preservation risks with VeraPDF and JHOVE
- On The Significant Properties of Spreadsheets
- PDF processing and analysis with open-source tools
- ISO/IEC TS 22424 standard on EPUB3 preservation
- Does Microsoft OneDrive export large ZIP files that are corrupt?
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Why PDF/A validation matters, even if you don't have PDF/A
- Measuring Bigfoot
- Assessing file format risks: searching for Bigfoot?
- PDF – Inventory of long-term preservation risks
- EPUB for archival preservation
-
VeraPDF
- Escape from the phantom of the PDF
- VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset
- Identification of PDF preservation risks with VeraPDF and JHOVE
- PDF processing and analysis with open-source tools
- Policy-based assessment with VeraPDF - a first impression
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Why PDF/A validation matters, even if you don't have PDF/A