Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

[RT 133131] readval() failure

  • 7 Replies
  • 68 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
[RT 133131] readval() failure
« August 08, 2020, 12:46:30 PM »
Sat Aug 08 11:48:56 2020 Christopher.Papademetrious@synopsys.com - Ticket created [Reply] [Forward]
Subject:    how do I file an issue for PDF::API2?
Date:    Sat, 8 Aug 2020 15:42:53 +0000
To:    "bug-PDF-API2@rt.cpan.org" <bug-PDF-API2@rt.cpan.org>
From:    Chris Papademetrious <Christopher.Papademetrious@synopsys.com>

Hi Steve,

I hope this email finds you at all! We're using PDF::API2 at our company, and I ran into a PDF that it doesn't like:

Reading 'chrispy/fcdm.pdf'...
Can't parse `' near 2316292767724077056 length 0. at /u/doc/perl5/lib/perl5/PDF/API2/Basic/PDF/File.pm line 694.

The parser hits a zero-length string and doesn't know what to do with it. I was going to file an issue in the Github repo, but there's no "Issues" tab like I see for other repos.

How do you suggest that I proceed?

And thank you for taking on ownership of this repo! It's a tremendously powerful library, and there's not an alternative that does everything that it does.

-----
Chris Papademetrious
Tech Writer, Implementation Group
(610) 628-9718 home office
(570) 460-6078 cell

Sat Aug 08 11:51:54 2020 chrispitude@gmail.com - Correspondence added

Ugh, I didn't realize this was going to create a ticket. I guess you're using rt.cpan.org for issue tracking but Github for revision control?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
Re: [RT 133131] readval() failure
« Reply #1: August 08, 2020, 12:49:39 PM »
Sat Aug 08 12:43:44 2020 PMPERRY@cpan.org - Correspondence added

An empty string $str was fed to the readval() method, so it doesn't know what to do (how to handle it). Can you tell us what you were trying to do (read in an existing PDF?) and maybe show a small test case Perl code? Feeding an empty (or nonsense format) string (PDF file content) into this routine will produce this error, but we need to figure out what happened "upstream" that triggered it.

A file offset of 232 quadrillion? That's bigger than any PDF file I've ever heard of, and possibly bigger than any filesystem can handle. Something seems to have gone very wrong there, but without knowing what you were doing, it's hard to diagnose.
Quote
there's not an alternative that does everything that it does

If you think that your PDF you're reading in is OK (loads OK in all the readers you try, AND NO READER ASKS PERMISSION TO SAVE THE FILE WHEN YOU EXIT IT), you might give PDF::Builder a try and see if that works any better. Maybe something upstream is different enough to get past whatever is causing the problem. It includes instructions on porting from PDF::API2, but usually just changing "PDF::API2" to "PDF::Builder" in your Perl code will do the job.

By the way, if your PDF is PDF version 1.5 or higher, that can blow up PDF::API2. Builder might handle it a little better, but no promises.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
Re: [RT 133131] readval() failure
« Reply #2: August 08, 2020, 07:36:02 PM »
Sat Aug 08 14:26:24 2020 chrispitude@gmail.com - Correspondence added [Reply] [Forward]
Subject:    PDF::API2 unable to open a compressed-stream PDF file

Thanks Phil! Given the history described in your docs and your desire for more aggressive issue resolution, it appears I should indeed be using PDF::Builder instead.

I'm trying to open a PDF file written out by an authoring/publishing tool:

Code: [Select]
#!/usr/bin/perl
use PDF::Builder;
my $pdf = PDF::Builder->open('bad.pdf');

I've isolated the issue to a single-page 7k PDF that fails to load in PDF::API2 and PDF::Builder, yet reads into Ghostscript (and every other tool I've tried) without complaint.

I tried decompressing the streams with

qpdf --qdf --object-streams=disable bad.pdf bad_decompressed.pdf

but PDF::API2 and PDF::Builder are both able to read the stream-decompressed version of the file. I'm attaching the problematic PDF file for your thoughts.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
Re: [RT 133131] readval() failure
« Reply #3: August 08, 2020, 07:39:45 PM »
Sat Aug 08 16:15:08 2020 chrispitude@gmail.com - Correspondence added

The bad.pdf attachment in the previous message is 7164 bytes:

% ls -l bad.pdf
-rwxrwxrwx 1 chrispy chrispy 7164 Aug  8 14:21 bad.pdf


It fails with an error about parsing an endstream construct:

====
Reading 'bad.pdf'...
Quote
Can't parse `endstream
endobj

startxref
6737
%%EOF

' near 7165 length 40. at /usr/local/share/perl/5.26.1/PDF/Builder/Basic/PDF/File.pm line 726.
====

I have another PDF file, bad2.pdf (attached to this message) that is 7406 bytes, but fails with an error about parsing an empty string:

====
Reading 'bad2.pdf'...
Quote
Can't parse `' near 6419318318863220736 length 0. at /usr/local/share/perl/5.26.1/PDF/Builder/Basic/PDF/File.pm line 726.
====

If I add or remove a few characters in the authoring tool, the resulting PDF fails with either the endstream message or the empty-string message. I suspect they're both variants of the same boundary behavior in the parsing.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
Re: [RT 133131] readval() failure
« Reply #4: August 08, 2020, 07:41:42 PM »
Sat Aug 08 19:31:23 2020 PMPERRY@cpan.org - Correspondence added

I can see a major problem right away. A PDF-1.4 file should have 'startxref' pointing to the cross-reference TABLE headed by 'xref' and the starting object and length. Instead, your 'bad.pdf' startxref is pointing to an object of type XRef, which appears to be a cross-reference STREAM. The minimum PDF level for a cross-reference stream is 1.5. Any idea how a PDF with 1.5 level features was labeled as 1.4? I don't think it's going to work.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
Re: [RT 133131] readval() failure
« Reply #5: August 08, 2020, 07:47:05 PM »
Sat Aug 08 19:47:14 2020 PMPERRY@cpan.org - Correspondence added

When I uncompressed the content streams, using PDFtk, it also worked with the 'open' call. I see that PDFtk gave me a proper PDF-1.4 cross-reference table structure when it created the new PDF, and this may have happened to you with qpdf.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
Re: [RT 133131] readval() failure
« Reply #6: August 09, 2020, 10:12:09 AM »
Sun Aug 09 08:19:09 2020 chrispitude@gmail.com - Correspondence added

Hi Phil,

Thanks for debugging the problem! Now I know the issue is with the PDF itself, not the parsing code. I'll take this up with the software vendor. And I'm going to have a closer look at PDF::Builder today - thanks for your efforts on this too!

Admins, feel free to close this ticket. (I don't have permissions to do so.)

Sun Aug 09 10:08:17 2020 PMPERRY@cpan.org - Correspondence added

Chris,

I can't guarantee that is the problem, but using a cross-reference stream in a PDF-1.4 document looks very suspicious to me. Certainly you should take it up with the vendor and see if they have an explanation for that (and why they feel it's OK to do). Please get back to us with whatever you find.

Since you opened this ticket by mail, I'm not sure you can close it yourself. If you can't, only Steve (the owner) can.

If you have any issues with PDF::Builder, please use its GitHub issues area to discuss them. Please don't clutter up PDF::API2's CPAN RT area with other products' issues.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 706
    • View Profile
Re: [RT 133131] readval() failure
« Reply #7: August 10, 2020, 09:54:14 PM »
Mon Aug 10 07:01:01 2020 futuramedium@yandex.ru - Correspondence added

Quote
Now I know the issue is with the PDF itself, not the parsing code.

Actually, it is the issue with parsing code, a bug in PDF::API2. In this line:

https://metacpan.org/release/PDF-API2/source/lib/PDF/API2/Basic/PDF/File.pm#L1146

the template should be 'Q>'. Moreover, limiting possible widths to 1,2,3,4,8 bytes in enclosing subroutine is arbitrary, but (1) at least there's provision to die noisily; (2) possibility of necessity of any value above 4 is extremely low. There's probably not much need to re-write the sub except the '>' insertion, but there's PDF::Tiny, CAM::PDF source for inspiration, if you decide otherwise. + The guys who use 8 bytes to encode offsets in their PDF lib are lazy indeed.

The version in header is overridden by document catalogue entry, so it doesn't matter.

Mon Aug 10 19:42:46 2020 chrispitude@gmail.com - Correspondence added

futuramedium, THANK YOU!!

Out of 186 PDF that exhibited the problem, your suggested code change fixed all of them. And the same code change worked equally well for both PDF::API2 and PDF::Builder.

We just installed the latest release of our publishing software, and it uses Apache FOP for publishing, so it's possible that this problematic output construct might occur outside my organization too.

Phil, do you want me to file a PDF::Builder issue for this code change?

And it looks like this ticket should indeed stay open for the change to be made in PDF::API2!

Thanks again to everyone who jumped in and quickly bashed this one out.

Mon Aug 10 21:48:54 2020 PMPERRY@cpan.org - Correspondence added

No need to open a PDF::Builder bug ticket... I have this one on file.

I will consider putting the patch in once I have a chance to carefully examine it and determine if it's really a useful fix, or is just papering over a PDF bug. I'm still very concerned over apparently putting a cross-reference stream (PDF-1.5) into a PDF-1.4 document. It would be nice to hear your vendor's explanation of why they did it that way. I'm reluctant to allow a PDF-1.5 feature in reading in a PDF-1.4 document. If I can detect that it's a cross-reference stream, I might be able to bump up the version to 1.5 on the fly, but I have to carefully look at it first.

PDF::API2 (and PDF::Builder) had some code added recently to handle cross-reference streams without blowing up, but I want to make sure I understand the full picture before I start slapping in ad-hoc fixes.