PDF – Inventory of long-term preservation risks
In this blog post I’ll be dusting off some old stuff for a change. The occasion for this is the following question, posted by Paul Wheatley on the Libraries and Information Science Stack Exchange website a few days ago:
What preservation risks are associated with the PDF file format?
Report
This reminded me of a report I wrote on this very subject back in 2009. (Incidentally this was my very first foray into the wacky world of digital preservation, but that’s another story.) Originally this document was intended for internal use at the KB, but looking at it again, I think it may be of interest to a wider audience. It also aligns quite nicely with the upcoming work on a knowledge base of file-format related risks that will be done as part of the SCAPE project. The main idea here is to take a file format, identify its main (preservation-related) risks, and describe how “risky” features can be detected by existing (characterisation) tools. In fact I was envisaging something along these lines when I wrote PDF report in 2009, but other things got in the way, and I never got round to the final step. The SCAPE work should finally make this happen.
Although the work on the knowledge base is still in its early stages, some very first results can be found here. The initial focus will be on JPEG 2000 (JP2/JPX) and PDF.
As for the report, I should add that some of it is a little rough around the edges, and you may note some gaps and not-quite-finished bits. This is also why we never released this first time around. Also, one aspect that is not well covered is PDF’s potential for transmitting viruses and other malware. Nevertheless, as a general introduction to the format and an overview of its main risks I think it’s not too shabby, but I’ll let you be the judge of that! As always, feel free to use the comment fields for you feedback and suggestions.
Link to report
Originally published at the Open Preservation Foundation blog
-
PDF
- PDF Quality assessment for digitisation batches with Python, PyMuPDF and Pillow
- Escape from the phantom of the PDF
- VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset
- Identification of PDF preservation risks with VeraPDF and JHOVE
- On The Significant Properties of Spreadsheets
- PDF processing and analysis with open-source tools
- Policy-based assessment with VeraPDF - a first impression
- PDF/A as a preferred, sustainable format for spreadsheets?
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Why PDF/A validation matters, even if you don't have PDF/A
- When (not) to migrate a PDF to PDF/A
- Identification of PDF preservation risks: analysis of Govdocs selected corpus
- Identification of PDF preservation risks with Apache Preflight: the sequel
- What do we mean by "embedded" files in PDF?
- Identification of PDF preservation risks with Apache Preflight: a first impression
- PDF – Inventory of long-term preservation risks
-
preservation-risks
- Escape from the phantom of the PDF
- Multi-image TIFFs, subfiles and image file directories
- Identification of PDF preservation risks with VeraPDF and JHOVE
- On The Significant Properties of Spreadsheets
- PDF processing and analysis with open-source tools
- ISO/IEC TS 22424 standard on EPUB3 preservation
- Does Microsoft OneDrive export large ZIP files that are corrupt?
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Why PDF/A validation matters, even if you don't have PDF/A
- Measuring Bigfoot
- Assessing file format risks: searching for Bigfoot?
- PDF – Inventory of long-term preservation risks
- EPUB for archival preservation
Comments