Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

[RT 133131] readval() failure

  • 18 Replies
  • 196 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
[RT 133131] readval() failure
« August 08, 2020, 12:46:30 PM »
Sat Aug 08 11:48:56 2020 Christopher.Papademetrious@synopsys.com - Ticket created [Reply] [Forward]
Subject:    how do I file an issue for PDF::API2?
Date:    Sat, 8 Aug 2020 15:42:53 +0000
To:    "bug-PDF-API2@rt.cpan.org" <bug-PDF-API2@rt.cpan.org>
From:    Chris Papademetrious <Christopher.Papademetrious@synopsys.com>

Hi Steve,

I hope this email finds you at all! We're using PDF::API2 at our company, and I ran into a PDF that it doesn't like:

Reading 'chrispy/fcdm.pdf'...
Can't parse `' near 2316292767724077056 length 0. at /u/doc/perl5/lib/perl5/PDF/API2/Basic/PDF/File.pm line 694.

The parser hits a zero-length string and doesn't know what to do with it. I was going to file an issue in the Github repo, but there's no "Issues" tab like I see for other repos.

How do you suggest that I proceed?

And thank you for taking on ownership of this repo! It's a tremendously powerful library, and there's not an alternative that does everything that it does.

-----
Chris Papademetrious
Tech Writer, Implementation Group
(610) 628-9718 home office
(570) 460-6078 cell

Sat Aug 08 11:51:54 2020 chrispitude@gmail.com - Correspondence added

Ugh, I didn't realize this was going to create a ticket. I guess you're using rt.cpan.org for issue tracking but Github for revision control?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #1: August 08, 2020, 12:49:39 PM »
Sat Aug 08 12:43:44 2020 PMPERRY@cpan.org - Correspondence added

An empty string $str was fed to the readval() method, so it doesn't know what to do (how to handle it). Can you tell us what you were trying to do (read in an existing PDF?) and maybe show a small test case Perl code? Feeding an empty (or nonsense format) string (PDF file content) into this routine will produce this error, but we need to figure out what happened "upstream" that triggered it.

A file offset of 232 quadrillion? That's bigger than any PDF file I've ever heard of, and possibly bigger than any filesystem can handle. Something seems to have gone very wrong there, but without knowing what you were doing, it's hard to diagnose.
Quote
there's not an alternative that does everything that it does

If you think that your PDF you're reading in is OK (loads OK in all the readers you try, AND NO READER ASKS PERMISSION TO SAVE THE FILE WHEN YOU EXIT IT), you might give PDF::Builder a try and see if that works any better. Maybe something upstream is different enough to get past whatever is causing the problem. It includes instructions on porting from PDF::API2, but usually just changing "PDF::API2" to "PDF::Builder" in your Perl code will do the job.

By the way, if your PDF is PDF version 1.5 or higher, that can blow up PDF::API2. Builder might handle it a little better, but no promises.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #2: August 08, 2020, 07:36:02 PM »
Sat Aug 08 14:26:24 2020 chrispitude@gmail.com - Correspondence added [Reply] [Forward]
Subject:    PDF::API2 unable to open a compressed-stream PDF file

Thanks Phil! Given the history described in your docs and your desire for more aggressive issue resolution, it appears I should indeed be using PDF::Builder instead.

I'm trying to open a PDF file written out by an authoring/publishing tool:

Code: [Select]
#!/usr/bin/perl
use PDF::Builder;
my $pdf = PDF::Builder->open('bad.pdf');

I've isolated the issue to a single-page 7k PDF that fails to load in PDF::API2 and PDF::Builder, yet reads into Ghostscript (and every other tool I've tried) without complaint.

I tried decompressing the streams with

qpdf --qdf --object-streams=disable bad.pdf bad_decompressed.pdf

but PDF::API2 and PDF::Builder are both able to read the stream-decompressed version of the file. I'm attaching the problematic PDF file for your thoughts.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #3: August 08, 2020, 07:39:45 PM »
Sat Aug 08 16:15:08 2020 chrispitude@gmail.com - Correspondence added

The bad.pdf attachment in the previous message is 7164 bytes:

% ls -l bad.pdf
-rwxrwxrwx 1 chrispy chrispy 7164 Aug  8 14:21 bad.pdf


It fails with an error about parsing an endstream construct:

====
Reading 'bad.pdf'...
Quote
Can't parse `endstream
endobj

startxref
6737
%%EOF

' near 7165 length 40. at /usr/local/share/perl/5.26.1/PDF/Builder/Basic/PDF/File.pm line 726.
====

I have another PDF file, bad2.pdf (attached to this message) that is 7406 bytes, but fails with an error about parsing an empty string:

====
Reading 'bad2.pdf'...
Quote
Can't parse `' near 6419318318863220736 length 0. at /usr/local/share/perl/5.26.1/PDF/Builder/Basic/PDF/File.pm line 726.
====

If I add or remove a few characters in the authoring tool, the resulting PDF fails with either the endstream message or the empty-string message. I suspect they're both variants of the same boundary behavior in the parsing.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #4: August 08, 2020, 07:41:42 PM »
Sat Aug 08 19:31:23 2020 PMPERRY@cpan.org - Correspondence added

I can see a major problem right away. A PDF-1.4 file should have 'startxref' pointing to the cross-reference TABLE headed by 'xref' and the starting object and length. Instead, your 'bad.pdf' startxref is pointing to an object of type XRef, which appears to be a cross-reference STREAM. The minimum PDF level for a cross-reference stream is 1.5. Any idea how a PDF with 1.5 level features was labeled as 1.4? I don't think it's going to work.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #5: August 08, 2020, 07:47:05 PM »
Sat Aug 08 19:47:14 2020 PMPERRY@cpan.org - Correspondence added

When I uncompressed the content streams, using PDFtk, it also worked with the 'open' call. I see that PDFtk gave me a proper PDF-1.4 cross-reference table structure when it created the new PDF, and this may have happened to you with qpdf.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #6: August 09, 2020, 10:12:09 AM »
Sun Aug 09 08:19:09 2020 chrispitude@gmail.com - Correspondence added

Hi Phil,

Thanks for debugging the problem! Now I know the issue is with the PDF itself, not the parsing code. I'll take this up with the software vendor. And I'm going to have a closer look at PDF::Builder today - thanks for your efforts on this too!

Admins, feel free to close this ticket. (I don't have permissions to do so.)

Sun Aug 09 10:08:17 2020 PMPERRY@cpan.org - Correspondence added

Chris,

I can't guarantee that is the problem, but using a cross-reference stream in a PDF-1.4 document looks very suspicious to me. Certainly you should take it up with the vendor and see if they have an explanation for that (and why they feel it's OK to do). Please get back to us with whatever you find.

Since you opened this ticket by mail, I'm not sure you can close it yourself. If you can't, only Steve (the owner) can.

If you have any issues with PDF::Builder, please use its GitHub issues area to discuss them. Please don't clutter up PDF::API2's CPAN RT area with other products' issues.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #7: August 10, 2020, 09:54:14 PM »
Mon Aug 10 07:01:01 2020 futuramedium@yandex.ru - Correspondence added

Quote
Now I know the issue is with the PDF itself, not the parsing code.

Actually, it is the issue with parsing code, a bug in PDF::API2. In this line:

https://metacpan.org/release/PDF-API2/source/lib/PDF/API2/Basic/PDF/File.pm#L1146

the template should be 'Q>'. Moreover, limiting possible widths to 1,2,3,4,8 bytes in enclosing subroutine is arbitrary, but (1) at least there's provision to die noisily; (2) possibility of necessity of any value above 4 is extremely low. There's probably not much need to re-write the sub except the '>' insertion, but there's PDF::Tiny, CAM::PDF source for inspiration, if you decide otherwise. + The guys who use 8 bytes to encode offsets in their PDF lib are lazy indeed.

The version in header is overridden by document catalogue entry, so it doesn't matter.

Mon Aug 10 19:42:46 2020 chrispitude@gmail.com - Correspondence added

futuramedium, THANK YOU!!

Out of 186 PDF that exhibited the problem, your suggested code change fixed all of them. And the same code change worked equally well for both PDF::API2 and PDF::Builder.

We just installed the latest release of our publishing software, and it uses Apache FOP for publishing, so it's possible that this problematic output construct might occur outside my organization too.

Phil, do you want me to file a PDF::Builder issue for this code change?

And it looks like this ticket should indeed stay open for the change to be made in PDF::API2!

Thanks again to everyone who jumped in and quickly bashed this one out.

Mon Aug 10 21:48:54 2020 PMPERRY@cpan.org - Correspondence added

No need to open a PDF::Builder bug ticket... I have this one on file.

I will consider putting the patch in once I have a chance to carefully examine it and determine if it's really a useful fix, or is just papering over a PDF bug. I'm still very concerned over apparently putting a cross-reference stream (PDF-1.5) into a PDF-1.4 document. It would be nice to hear your vendor's explanation of why they did it that way. I'm reluctant to allow a PDF-1.5 feature in reading in a PDF-1.4 document. If I can detect that it's a cross-reference stream, I might be able to bump up the version to 1.5 on the fly, but I have to carefully look at it first.

PDF::API2 (and PDF::Builder) had some code added recently to handle cross-reference streams without blowing up, but I want to make sure I understand the full picture before I start slapping in ad-hoc fixes.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #8: November 09, 2020, 10:04:37 PM »
Mon Nov 09 21:29:41 2020 PMPERRY@cpan.org - Correspondence added

Christopher, did you ever get a chance to ask your vendor why they have a cross-reference stream in a file claimed to be level 1.4? With Vadim's patch, code that Steve put in earlier to handle this PDF 1.5 feature may be getting it through OK, but I'm still worried about what else may be waiting to go wrong. I still say that the PDF is wrong.

Vadim, if I read the cross-reference stream documentation correctly, it allows widths of 1, 2, 3, 4, and 8. 3 is actually treated as 4, with a 00 byte shoved in front. It seems to say that these fields should be Big-Endian (MSB). Are we in agreement? Then why (without the > flag) does it properly handle 2, 3, and 4 byte lengths, but treat 8 byte widths as Little-Endian? In other words, why don't 2, 3, and 4 widths require '>' too? Did the PDF writer create the value Little-Endian, and the '>' turns it around (flips and then reads it Big-Endian), and if so, why don't 2, 3, and 4 need this treatment? I'm just uncomfortable with putting this patch in until I understand why '8' is so different -- or was the PDF created incorrectly, with 64-bit integers flipped around?

I have been unable to read the stream directly, as it's flate compressed, and when uncompressing it, PDFtk changes it to a cross-reference TABLE even if I first change the PDF version to 1.5. So, I can't see how the original value was stored (as Big-Endian or Little-Endian).

Mon Nov 09 22:00:32 2020 PMPERRY@cpan.org - Correspondence added

After I sent off the last post, I figure out how to look at the cross-reference table data (dumping it in File.pm). Only widths 2 and 8 are used, and in both, the data appears to be Big-Endian (MSB). So this aspect of the PDF, anyway, appears to have been written correctly. 'n' and 'N' codes don't allow '>', so that's a moot point. So the question remains, why does 'Q' require an explicit '>' to be read correctly, and will this change on machines which are natively Big-Endian (non-Intel chips)? Is it that 'Q' allows either way, and if you're possibly transferring data across chip types, you'd better specify explicitly that the PDF data you're unpacking with 'Q' is Big-Endian? Also, if writing (pack) with 'Q', will it write Little-Endian on an Intel chip?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #9: November 10, 2020, 01:05:19 PM »
The 'Q>' fix seems to work without breaking anything else, so I went ahead and put it in. I also added a warning to be output if a cross-reference stream was encountered for a PDF declared to be 1.4 or lower (and the output version is bumped up to 1.5). Closing.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #10: November 15, 2020, 07:34:21 PM »
Sat Nov 14 17:19:17 2020 futuramedium@yandex.ru - Correspondence added

Hi,

Quote
I read the cross-reference stream documentation correctly, it allows widths of 1, 2, 3, 4, and 8

No, width can be any, and is not limited, btw

Quote
So the question remains, why does 'Q' require an explicit '>' to be read correctly, and will this change on machines which are natively Big-Endian (non-Intel chips)?

It won't. To quote the Reference: "Fields [in a cross-reference stream] requiring more than one byte are stored with the high-order byte first."

Quote
file claimed to be level 1.4

It didn't. The Version ("Optional; PDF 1.4") entry in catalog dictionary takes precedence "if later than the version specified in the file’s header". So, 1.4-compliant consumer must consult this entry first, before making final decision about version. It follows, that 1.5-compliant consumer (which PDF::API2 is) must try to read cross-reference stream if required, to check that "Version" entry, even if header says "1.4". It's a bit of a conundrum, I agree, but it's how things are.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #11: November 15, 2020, 08:09:38 PM »
Sun Nov 15 20:00:43 2020 PMPERRY@cpan.org - Correspondence added

On Sat Nov 14 17:19:17 2020, vadimr wrote:
Quote
Hi,
 
Quote
I read the cross-reference stream documentation correctly, it allows widths of 1, 2, 3, 4, and 8

No, width can be any, and is not limited, btw

I'm not following you. The width field is one of those widths (integer byte size), isn't it? The resulting width integer can of course be any legitimate positive integer that a field of that size can hold.
 
Quote
Quote
So the question remains, why does 'Q' require an explicit '>' to be read correctly, and will this change on machines which are natively Big-Endian (non-Intel chips)?

It won't. To quote the Reference: "Fields [in a cross-reference stream] requiring more than one byte are stored with the high-order byte first."

Let me clarify what I was asking. I can see that the data *was* high-order byte first ("network order"/Big Endian) in the file, which is correct. What I was asking was what happens for just 'Q' (as the original code was), as opposed to 'Q>'. If using just 'Q', it appears that my Intel CPU reads (and writes) in low-order byte (Little Endian). Why is 'Q' treated differently, and I need to give the byte order explicitly? This isn't a PDF::API2/Builder issue; it's a Perl issue.

This also brings up the problem that if your Perl isn't compiled for 64 bit integers, supposedly it's going to blow up on a 'Q' (or 'Q>') unpack. If this particular PDF was being processed on a 32 bit Perl, it's likely to fall over dead. I wonder if we should use the Config package to query if Perl supports 64 bit integers, and if not, see if we can treat it as a 32 bit (unsigned) integer by just unpacking the bottom 32 bits (first checking that the first 33 bits, including the bottom's sign, are 0)? Or, is 32 bit Perl so rare these days that we shouldn't bother? This appears to be the only place that 64 bits are baked into the library.

Quote
Quote
file claimed to be level 1.4

It didn't. The Version ("Optional; PDF 1.4") entry in catalog dictionary takes precedence "if later than the version specified in the file’s header". So, 1.4-compliant consumer must consult this entry first, before making final decision about version. It follows, that 1.5-compliant consumer (which PDF::API2 is) must try to read cross-reference stream if required, to check that "Version" entry, even if  header says "1.4". It's a bit of a conundrum, I agree, but it's how things are.

Ah, poking through the PDF I can see a Catalog entry for /Version /1.5. Still, that's pretty sloppy to declare 1.4 in the header and then 1.5 deep down inside. Anyway, it looks like I'm going to have to figure out how to read this Catalog(s) for an overriding Version entry, before doing any checking for version-dependent features.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #12: November 16, 2020, 07:05:31 PM »
Mon Nov 16 09:33:23 2020 chrispitude [...] gmail.com - Correspondence added

On Mon Nov 09 21:29:41 2020, PMPERRY wrote:
Quote
Christopher, did you ever get a chance to ask your vendor why they have a cross-reference stream in a file claimed to be level 1.4? With Vadim's patch, code that Steve put in earlier to handle this PDF 1.5 feature may be getting it through OK, but I'm still worried about what else may be waiting to go wrong. I still say that the PDF is wrong.

Hi Phil,

The product is PDF Chemistry, a DITA publishing tool from Syncrosoft. PDF Chemistry uses Apache FOP internally for PDF creation, but I did not learn any specifics beyond that.

On Sat Nov 14 17:19:17 2020, vadimr wrote:
Quote
It didn't. The Version ("Optional; PDF 1.4") entry in catalog dictionary takes precedence "if later than the version specified in the file’s header". So, 1.4-compliant consumer must consult this entry first, before making final decision about version. It follows, that 1.5-compliant consumer (which PDF::API2 is) must try to read cross-reference stream if required, to check that "Version" entry, even if header says "1.4". It's a bit of a conundrum, I agree, but it's how things are.

Hi Vadmin,

Possibly related, possibly not... Any utility that uses a Poppler/Cairo version from around 2009 fails with the following error:

Quote
Error: PDF file is damaged - attempting to reconstruct xref table...
Error: Couldn't find trailer dictionary
Error: Couldn't read xref table

However, later versions read the PDF successfully. I wonder if they fixed a similar bug to what you describe?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #13: November 20, 2020, 01:03:18 PM »
Fri Nov 20 13:01:21 2020 PMPERRY@cpan.org - Correspondence added

A little bit of a side trip -- I have added code to PDF::Builder to validate the structure of the PDF being read, and in the process, pick up any Version override (so, for example, bad.pdf is recognized as PDF-1.5 and I don't unnecessarily flag the cross-reference stream as non-1.4).

While I'm parsing the PDF, I'm looking at, among other things, Parent entries. I notice that in bad.pdf, object 24 lists objects 19-23 as its children (/Kids), but none of them list a /Parent (presumably back to 24). Is a Parent entry mandatory for a Kid (and possibly some other parent-child relationships), or is it optional? Does it always have to point back to the object who claims this object as its Kid, or is it legal to point somewhere else? I'm thinking of ticket 130722's afhacked2.pdf's object 4 declaring object 6 to be its child (/Kid) but so does object 9, and 6 points back to 9 as its Parent. That sounds fishy to me. The PDF 1.7 Reference sometimes calls a Parent mandatory, but then often omits a Parent from the example.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 749
    • View Profile
Re: [RT 133131] readval() failure
« Reply #14: November 22, 2020, 03:03:10 PM »
Sun Nov 22 07:36:50 2020 futuramedium@yandex.ru - Correspondence added

Christopher, more likely older versions of Poppler/Cairo didn't support 1.5 features.

Phil, "any" is literally ANY, width can be 1-2-3-4-8, but also 5-6-7-9-10-...1000-..., etc. to PDF architectural integer limit (2**32 - 1). What's an integer (byte offset of an object, in particular) so many bytes long -- it's beyond comprehension and hardware capabilities and practical requirements. See the H.21 end-note in "PDF Reference, sixth edition", which explicitly states that the Reference, itself, does not impose ANY limit on offset byte width. I don't know what you mean by "reading documentation correctly" and finding there allowed widths of 1-2-3-4-8.

So, a Reader, theoretically and nominally, must cope with any width, but practically -- see what I said in August about a fix, just a character insertion.

However, you raised a valid concern about 32-bit Perls compiled without "USE_64_BIT_INT", regardless of them being worth any effort. Then, again, I'm repeating myself, have a look at sister packages, how they handle the issues -- quite differently from each other (and PDF::API2), but BOTH can cope with ANY widths of arbitrary size, not just 1-2-3-4-8, and regardless of Perl being 32bit/64bit (of course, as long as integer to be decoded fits 32/64 bits, as applicable, -- i.e. byte string may have leading zeroes).

What follows is rather off-topic.

Quote
what happens for just 'Q' (as the original code was)

The original code was tested, if ever, using big-endian CPU

Quote
Why is 'Q' treated differently, and I need to give the byte order explicitly?

?? Because it's documented so. By design. How is that "Perl issue"? It's matter of POV -- the N/V (n/v) pairs are peculiar exception, all other relevant templates require explicit byte order modifier to work in portable manner.

Quote
Is a Parent entry mandatory for a Kid

The Reference has comprehensive Index, I don't think there are any ambiguities where and which entries are required. There are trees of slightly different breeds. E.g., items of Pages Tree, Name Tree (your example), Outlines tree(-like structure) require (1) both Kids/Parent, (2) Kids only, (3) Parent only entries, respectively. Evolving standard (as PDF was) can finish eclectic, which is OK as long as everything is clearly documented.

The "afhacked2.pdf", IIRC, was shown to be horribly broken EXACTLY w.r.t. parental relationship in a tree, why would you pick it up as example to investigate.