Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

RT 106020 - PDF validation (was Bug with recognizing PDF files via open_scalar)

  • 1 Replies
  • 1595 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Wed Jul 22 09:56:37 2015 dearly [...] scenariolearning.com - Ticket created
Subject:    Bug with recognizing PDF files via open_scalar
Date:    Wed, 22 Jul 2015 09:56:23 -0400
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Douglas Early <dearly [...] scenariolearning.com>

Ran into this bug when attempting to import a large(ish) number of PDFs stored as scalar data in a databse - about 72 PDFs in all.  Most of them work fine but a few are not recognized as being valid PDFs despite rendering just fine in browsers or with Acrobat.

The error message is as follows:
*GLOB(0xd837530) not a PDF file version 1.x at /home/dearly/git-working/document/Document/script/../local/lib/perl5/PDF/API2/Basic/PDF/File.pm at line 241*

The head of the file (retrieved in the variable buffer) looks like this
Code: [Select]
%PDF-1.4 ▒P2 0 obj  <</Length 3 0 R /Filter /FlateDecode >>  stream
Q0T0BC3c#c3▒▒\▒>y
endstream endobj
3 0 obj 31 endobj
4 0 obj  <</Width 2544 /Height 3300 /BitsPerComponent 1 /Subtype /Image /Type /XObject /ColorSpace/DeviceGray /Lengf32b8e','8867a55c-5513-4bce-b2dd-700950cee8cb'

I noticed that removing the $cr variable from the regex on line 240 that tests for validity allows the file to pass.  Perhaps $cr needs amended or simply removed from the regex pattern?

Cheers,

--
Doug EarlySoftware Developer
Scenario Learning

#
Wed Jul 22 12:26:44 2015 steve [...] deefs.net - Correspondence added

Hi Doug,

Interesting.  In this case, the error message is correct -- according to the spec (section 7.5.2), the first line of the file may only contain the header (%PDF-1.#), and the second line needs to be a comment line if there are any characters that aren't 7-bit ASCII set (I'm pretty sure PDF::API2 doesn't check for that), so the file isn't a valid PDF.

If changing the regex works for you, feel free to keep the change, but I would expect the file to have issues (perhaps not obvious ones) in other readers as well, given what you've shown me.

Steve
#
Wed Jul 22 12:26:44 2015 The RT System itself - Status changed from 'new' to 'open'
#
Wed Jul 22 12:26:48 2015 steve [...] deefs.net - Status changed from 'open' to 'rejected'
#
Thu Jul 23 09:06:14 2015 dearly [...] scenariolearning.com - Correspondence added
Subject:    [rt.cpan.org #106020]
Date:    Thu, 23 Jul 2015 09:05:47 -0400
To:    bug-pdf-api2 [...] rt.cpan.org
From:    Douglas Early <dearly [...] scenariolearning.com>

Good Morning!

Thanks for the swift reply.  I think for now I will simply leave in the modification I mentioned previously.  For our use case we are more concerned with being able to stitch and display the PDFs we've been given (managing an inventory of chemical safety data sheets) than with strict conformance.

Would there be any interest in a submitted patch that would allow the PDF::API2 object to be instantiated with a validation setting?  Perhaps something akin to

Code: [Select]
validation => 'strict'
or

Code: [Select]
validation -> 'compatability'
That would control how strictly the PDF is validated?

I ask because the language in the PDF spec regarding the second line:
Quote
If a PDF file contains binary data, as most do (see Section 3.1, “Lexical Conventions”), it is recommended that the header line be immediately followed by a comment line containing at least four binary characters

Makes it sound like the second line is a suggestion rather than a hard requirement (unlike the first line which is fairly explicit in being non-optional).

I would be willing to code and submit a patch for that if it you feel that it would contribute something to the project.

Cheers,

--
Doug EarlySoftware Developer
Scenario Learning

<formatting cleanup - Mod.>
« Last Edit: May 01, 2017, 10:29:05 AM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
See also RT 117210 — a request to repair damaged or corrupt PDF files, which seems to parallel this one. Possibly related might be RT 120397.