CTS logo
hazy blue Catskill Mountains in distance

A Thought…

There are worse things in life than death. Have you ever spent an evening with an insurance salesman?

   — Woody Allen

PDF Validation

Posted on 2023-11-27 at 10:19:00 by Phil
Last update on 2023-11-29 at 11:19:00 by Phil

Something I’ve requested on the Adobe Community Forums (Adobe Acrobat Reader support) is a good PDF (Portable Document Format) debugger or analyzer. This goes beyond simply dumping a PDF file in a readable format — it would look through the file and examine it for (at least) the following items:

  • Make a first scan through to pick up the current declared PDF version being used (both header line and further declarations). If it should be surprised by encountering another declaration later, that would be a warning.
  • Check that appropriate objects exist for all object references. A reference to an object should always point to an existing object of the right type (e.g., a Page). Beware of object streams that may conceal objects!
  • Check that the overall structure seems to be OK (including no duplicate object numbers) and perhaps that there are no circular references (i.e., parent to child, but the child could not be parent to something above it). Also check that a child declares the correct parent.
  • Check that values are within allowable ranges, which may vary according to the PDF version in use.
  • Check that no required parameters are missing from an object.
  • Check that optional parameters are appropriate and not conflicting with other optional, or mandatory, parameters.
  • Check that streams contain only the appropriate operators (e.g., no q save or Q restore in a text context). Watch out for disallowed graphics stream content once path output has started.
  • At all times, keep an eye out for objects or parameters for a higher PDF version than what has been declared.
  • A warning could be given for dead code in a stream (such as a font declaration that is overridden by another font declaration before any text is output).

That ought to take care of a lot of common errors. More edge and corner cases could be added over time.

Should anyone implement such a utility, whether free/open source or for pay, please let me know so I can put a pointer to it in here!

Further discussion (appendable) in https://github.com/PhilterPaper/Perl-PDF-Builder/issues/199 .


Posted on 2023-11-27 at 11:22:00 by Phil

Even before implementing a debugger/analyzer, a useful set of tools would be utilities to “dump” and “undump” a PDF file to and from a text (flat) file. This would permit a user to easily view the innards of a PDF, modify it with a normal text editor, and write it back out into a usable PDF file.

The text file would not necessarily have any analysis or debugging done; just a clean dump of the contents, without any specific labeling or indication of what something is. Such extra material might be added by the user, and removed during output.

  • Input (dump) would produce numeric object numbers <n>, with optionally named <root> and <info> objects. Output (undump) would permit the user to give a unique name to any object, and refer to it by that name anywhere in the file. A preferred object number could optionally be given for each named object.
  • All stream lengths, whether inline or in an object of their own, would be replaced by length(<obj>) during input (dump) and would be recalculated during output (undump).
  • The cross reference table or stream would be recalculated upon output (undump).
  • All data would be presented in non-binary human-reable form (possibly a simplified XML-like format), and binary data would be restored upon output (undump). Binary data (such as a font or image) would probably be escaped, and we would have to decide how to present UTF-8 etc. text (that would probably be a command-line user option).
  • Stream Compression will be undone upon input (dump) and unless overridden, restored upon output (undump). Likewise for password protection?
  • Permit comments and other such notations in the text file, that will likely be stripped out upon output (undump).
  • A user should be able to freely reformat an object’s stream, to break it up into easily comprehensible lines, and have it go back together upon output (dump), although spaces may end up being added or removed, compared to the original.
  • Optionally, appended/replaced objects and cross reference tables could be consolidated to reduce the size of the PDF table.
  • Possibly, cross-reference streams and object streams could be converted into cross-reference tables and ordinary objects. It would be up to the user to know whether it’s safe to lower the PDF version to 1.4.
  • Integers with leading 0’s and only octal digits may apparently be misinterpreted as octal constants by some Readers, and should be cleaned up (remove leading zeros).

Note that even if you “round trip” a PDF file, there’s no guarantee that the resulting PDF will be byte-for-byte identical to the original PDF! It should be functionally identical, but may be slightly different internally. All stream lengths and object offsets will be recalculated and may change.

Should anyone implement such a utility, whether free/open source or for pay, please let me know so I can put a pointer to it in here!


Posted on 2023-12-01 at 14:31:00 by Phil

Another item of interest in debugging a PDF, is finding out why certain images display correctly on some readers, but not on others. The error may be a blank screen, or a message that there is insufficient data.

There are many image formats, each with numerous variants, and different compression methods. It may not be possible for the image-checking section of a debugger to handle everything, but perhaps we can make a start. At least, we can output a warning that something was found that is known to work on some Readers, but may not on others.

Somewhere in the header for an image object should be information on what compression method is being used. That could be checked for compatibility with a range of Readers. The data stream length can also be checked, but if it’s the expanded (decoded) data, the debugger would have to be able to uncompress the data, after which it could confirm that there is a correct amount of data. This should be possible, but is a lot of work.

Different image formats have different ways of arranging data, which not all Readers may support. For example, TIFF’s CCITT Group 4 fax has some data arrangements that apparently few Readers support. If this data is hidden within the data stream itself, and not in the object header, it may not be possible for a debugger to spot the possible problem (informing the user that a certain format is in use, which may cause problems).

 

All content © copyright 2005 – 2025 by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this site remains under the copyright and license of its owners. Catskill Technology Services, LLC does not claim copyright over such software.

 

This page is https://www.catskilltech.com/utils/show.php?link=pdf-validation

Search Quotations database.

Last updated Wed, 03 Jan 2024 at 9:32 AM

Valid HTML 5

Tue, 11 Feb 2025 at 1:11 AM EST