Policy-based assessment with VeraPDF - a first impression
Some four years ago I wrote a blog post that demonstrated how Apache Preflight (the PDF/A validator tool that is part of Apache PDFBox) can be used to detect features in a PDF that are potential preservation risks. A follow-up blog applied Schematron rules to the Preflight output in an attempt at doing policy-based assessments. The results of that work were quite promising, but dealing with Preflight’s multitude of (especially font-related) validation errors proved to be a challenge.
The idea of using a PDF/A validor for policy-based assessments of “regular” PDF files (i.e. PDFs that are not necessarily PDF/A) was explicitly addressed as a use case for veraPDF. With VeraPDF now having entered its “final testing phase”, I thought this was a good time for a small test-drive of veraPDF’s capabilities in this area. All test results are based on VeraPDF 1.4.7.
Test data
For this test I used PDFs from the Adobe Acrobat Engineering website (sadly gone since 2015). As in my 2013 blog post, I limited the analysis to:
- all files in the General section of the Font Testing category;
- all files in the Classic Multimedia section of the Multimedia & 3D Tests category.
The dataset is quite small, but contains many complex and otherwise challenging PDFs, which make it an interesting dataset for testing.
Policy
The policy is similar to the one used in my 2014 blog post, and it is defined by the following objectives:
- No encryption / password protection
- All fonts are embedded
- No embedded files
- No file attachments
- No multimedia content (audio, video, 3-D objects)
- No PDFs that raise an exception or result in a processing error in VeraPDF (PDF validity proxy)
(Note that the 2014 blog post also mentioned the absence of JavaScript as an additional objective. However, it turned out that the necessary output for this is not currently reported by VeraPDF.)
Subsequently I ‘translated’ each of these objectives into Schematron rules. For a basic how-to see the veraPDF Policy Checking documentation.
The full Schematron file can be found here.
VeraPDF configuration
It is important to note that, unlike in my earlier Apache Preflight experiments, the Schematron rules do not rely on the PDF/A validation output! Instead, VeraPDF can be instructed to include a ‘features report’ in its output, which directly points to technical features such as font properties, annotation types, security features, and so on. Most of the features that are needed for a policy-based assessment are disabled by default. So, we first need to activate these in the configuration (file features.xml in VeraPDF’s config directory). I edited it as below:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<featuresConfig>
<enabledFeatures>
<feature>ANNOTATION</feature>
<feature>DOCUMENT_SECURITY</feature>
<feature>EMBEDDED_FILE</feature>
<feature>FONT</feature>
<feature>INFORMATION_DICTIONARY</feature>
</enabledFeatures>
</featuresConfig>
Basic operation
Supposing that the PDFs we want to analyze are in directory ~/myPdfs
, and that the Schematron rules that represent our policy are in the file demo-policy.sch
, we can do a policy-based validation of all these files with one single command:
verapdf -x --policyfile demo-policy.sch ~/myPdfs/* > myPdfsOut.xml
Here the -x
switch activates feature extraction. The output file myPdfsOut.xml
contains, for each PDF, an element with PDF/A validation output, an element with the features report, and an element with the policy report.
Analysis script
Typically the VeraPDF output is rather unwieldy. To facilitate things I wrote a custom analysis script, which does the following things:
- It runs VeraPDF
- It creates a trimmed-down version of the output file that only contains the policy report. Also, for each PDF, it removes duplicate instances of failed (policy) checks (e.g. if a check on font embedding fails for 10 different fonts, only one reference to the failed check is kept)
- It creates a comma-delimited summary file which lists for each PDF its path/name, followed by the description of each unique failed validation rule (taken from the message element in VeraPDF’s output).
Running the analysis
For this analysis I ran the above script for both the fonts and multimedia files, using the following command line (here for the fonts files):
~/pdfPolicyVeraPDF/policyValidate.sh /home/johan/pdfAcrobatEngineering/fonts /home/johan/pdfPolicyVeraPDF/schemas/demo-policy.sch fonts
Results, fonts category
The following table lists, for each PDF in the fonts category, the corresponding (unique) validation errors (taken from the summary CSV file). Note that the text strings in the right column correspond to text values in the assert elements of the policy file.
Test file | Failed assert(s) |
---|---|
EmbeddedCmap.pdf | Font is not embedded |
embedded_fonts.pdf | Font is not embedded |
embedded_pm65.pdf | |
notembedded_pm65.pdf | Font is not embedded |
printtestfont_nonopt.pdf | |
printtestfont_opt.pdf | |
substitution_fonts.pdf | Font is not embedded |
text_images_pdf1.2.pdf | Font is not embedded |
TEXT.pdf | Font is not embedded |
Type3_WWW-HTML.PDF | Font is not embedded |
These results show that most of the PDFs fail our policy on the font embedding objective.
Results, multimedia category
Similarly, below are the results for the multimedia category:
Test file | Failed assert(s) |
---|---|
20020402_CALOS.pdf | Font is not embedded;Movie annotation |
3-D_PDF.pdf | 3D annotation |
AdobeChassisDemo-commented.pdf | 3D annotation |
AdobeChassisDemo-commented_Review.pdf | 3D annotation |
AVI+Transitions Demo.pdf | Document not parsable |
Binder_6-3DPages.pdf | 3D annotation |
Disney-Flash.pdf | Font is not embedded;Screen annotation |
drape_raster_contour_sample.pdf | Font is not embedded;3D annotation |
gXsummer2004-stream.pdf | Document not parsable |
Jpeg_linked.pdf | Encrypted document;Document not parsable |
LabelExample.pdf | Encrypted document;Document not parsable |
movie_down1.pdf | Movie annotation |
movie.pdf | Movie annotation |
MultiMedia_Acro6.pdf | Encrypted document;Document not parsable |
MusicalScore.pdf | Font is not embedded;Screen annotation |
phlmapbeta7.pdf | Font is not embedded;Screen annotation |
remotemovieurl.pdf | Font is not embedded;Movie annotation |
ScriptEvents.pdf | Font is not embedded;Screen annotation |
Service Form_media.pdf | Font is not embedded;Screen annotation |
SVG-AnnotAnim.pdf | Font is not embedded |
SVG.pdf | Font is not embedded |
Trophy.pdf | Font is not embedded;Screen annotation |
us_population.pdf |
Here the reasons for failing the policy are more diverse. Many of these PDFs contain Screen, Movie or 3D annotations. Non-embedded fonts are common as well. Three PDFs were not parsable because of encryption. This turns out to be a a bug that is fixed in newer versions of VeraPDF. Two files (AVI+Transitions Demo.pdf and gXsummer2004-stream.pdf) were not parsable at all. These files could not be opened in Adobe Acrobat either. Finally, one 49 MB file (which is not listed in the table) resulted in an out-of-memory error that crashed VeraPDF altogether. I reported this as a bug.
General observations
First of all I was impressed with the amount of detailed information that VeraPDF can provide of a PDF file. I was also pleasantly surprised at the relative ease of doing policy-based assessments. This is mainly thanks to VeraPDF’s features report, which allows one to address features such as specific annotation types directly. During my earlier attempts at policy-based assessment with Apache Preflight, the detection of non-embedded fonts was particularly difficult (have a look at the Schematron file to see what I mean). With VeraPDF this only needs one single line (though admittedly this probably means that errors related to damaged or malformed fonts won’t be reported). Thanks to VeraPDF’s built-in functionality to do the Schematron validation, it is no longer necessary to use an external Schematron validator (though this is still possible).
Actions missing in action?
One thing I missed is the reporting of Actions. Without this, it is not possible to identify PDFs that contain JavaScript (and some other features as well). An option to include Actions in the ‘Feature Report’ would make a welcome addition. As the PDF/A validation profiles already include checks on Actions, this is probably pretty straightforward (see also this issue).
Writing a policy file
Not having worked on PDF-related things for a while myself, it took me some time to figure out how to put together the (Schematron) policy file. The VeraPDF documentation gives some guidance, but I couldn’t find an exhaustive description of every possible feature in the features report. This meant I first had to run VeraPDF (with feature extraction enabled) on a number of files that I knew to contain certain features I wanted to include in my policy (e.g. embedded fonts, multimedia), inspect the XML output, and then write my Schematron rules based on that output. As I have a pretty good knowledge of the specific PDF data structures involved I was able to do this, but it did make me wonder about users who don’t have that technical knowledge. Possible solutions would be:
- Additional documentation of all possible output elements in the features report. This seems to be in the works already (though not complete yet)
- Inclusion of some example policy files. Actually veraPDF’s Github repo contains a number of these already, but they are not (yet) referenced by the documentation, and I only found out about them after I ran my tests.
It would also help if users of veraPDF would publish and share their policy files.
Finally it just occurred to me this is a good occasion to give one more bump to this 2009 report I wrote on long-term preservation risks of PDF. It explicitly lists the data structures (e.g. annotations, actions) that are associated with specific (risky) features, which might provide users some guidance as to what features are potentially interesting for inclusion in a policy.
Links
- PDF policy-based validation demo, veraPDF - Github repo with scripts, Schematron policy file and all output files
- VeraPDF
- Adobe Portable Document Format - Inventory of long-term preservation risks
Originally published at the Open Preservation Foundation blog
-
PDF
- PDF Quality assessment for digitisation batches with Python, PyMuPDF and Pillow
- Escape from the phantom of the PDF
- VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset
- Identification of PDF preservation risks with VeraPDF and JHOVE
- On The Significant Properties of Spreadsheets
- PDF processing and analysis with open-source tools
- Policy-based assessment with VeraPDF - a first impression
- PDF/A as a preferred, sustainable format for spreadsheets?
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Why PDF/A validation matters, even if you don't have PDF/A
- When (not) to migrate a PDF to PDF/A
- Identification of PDF preservation risks: analysis of Govdocs selected corpus
- Identification of PDF preservation risks with Apache Preflight: the sequel
- What do we mean by "embedded" files in PDF?
- Identification of PDF preservation risks with Apache Preflight: a first impression
- PDF – Inventory of long-term preservation risks
-
schematron
- PDF Quality assessment for digitisation batches with Python, PyMuPDF and Pillow
- Policy-based assessment with VeraPDF - a first impression
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Policy-based assessment of EPUB with Epubcheck
- Automated assessment of JP2 against a technical profile
-
VeraPDF
- Escape from the phantom of the PDF
- VeraPDF parse status as a proxy for PDF rendering: experiments with the Synthetic PDF Testset
- Identification of PDF preservation risks with VeraPDF and JHOVE
- PDF processing and analysis with open-source tools
- Policy-based assessment with VeraPDF - a first impression
- Why PDF/A validation matters, even if you don't have PDF/A - Part 2
- Why PDF/A validation matters, even if you don't have PDF/A
Comments