Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

[RT 133131] readval() failure

  • 18 Replies
  • 808 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 823
    • View Profile
Re: [RT 133131] readval() failure
« Reply #15: November 22, 2020, 03:16:34 PM »
Sun Nov 22 15:00:02 2020 PMPERRY@cpan.org - Correspondence added

On Sun Nov 22 07:36:50 2020, vadimr wrote:

Quote
Phil, "any" is literally ANY, width can be 1-2-3-4-8, but also 5-6-7-9-10-...1000-..., etc. to PDF architectural integer limit (2**32 - 1). What's an integer (byte offset of an object, in particular) so many bytes long -- it's beyond comprehension and hardware capabilities and practical requirements. See the H.21 end-note in "PDF Reference, sixth edition", which explicitly states that the Reference, itself, does not impose ANY limit on offset byte width. I don't know what you mean by "reading documentation correctly" and finding there allowed widths of 1-2-3-4-8.

I was probably thrown by the PDF::API2 implementation, which only allows 1, 2, 3, 4, and 8 byte widths. You're saying that any width up to some enormous number of bytes is theoretically possible? As very little hardware out there probably handles >64 bit integers, a maximum width of 8 is probably adequate.

Quote
So, a Reader, theoretically and nominally, must cope with any width, but practically -- see what I said in August about a fix, just a character insertion.

So, if some joker decides to provide a PDF with a cross-reference stream field width of 5 bytes (40 bit integer), PDF::API2 (and until I extend it, PDF::Builder) will choke on it? Even though it's legitimate? It shouldn't be too much trouble to handle 5, 6, and 7 byte width fields by padding with x00 bytes (as in the manner of width 3). >8 bytes is probably unreasonable for the next few years, until hardware (and Perl) catches up (i.e., a generation of 128 bit chips).

Quote
However, you raised a valid concern about 32-bit Perls compiled without "USE_64_BIT_INT", regardless of them being worth any effort. Then, again, I'm repeating myself, have a look at sister packages, how they handle the issues -- quite differently from each other (and PDF::API2), but BOTH can cope with ANY widths of arbitrary size, not just 1-2-3-4-8, and regardless of Perl being 32bit/64bit (of course, as long as integer to be decoded fits 32/64 bits, as applicable, -- i.e. byte string may have leading zeroes).

As I've said before, I can check if the 33 leading bits for a 64 bit (after x00 padding 5, 6, or 7 byte fields) integer field are 0, and just decode the low 32 bits as 'N' format. If it's not an unsigned 32 bit value, we'll just have to throw it to unpack('Q>') and hope for the best that it's a 64 bit Perl. I understand that it will produce a smoking hole in the ground if it's a 32 bit Perl. I suppose I could use some sort of "extended math" package to handle the field value as two 32 bit ints or four 16 bit ints, but I'm not sure it's worth the effort. Do you know how (in general terms) these "sister packages" handle 64 bit integers -- perhaps some sort of extended math?

Quote
Quote
what happens for just 'Q' (as the original code was)

The original code was tested, if ever, using big-endian CPU

Very likely. A (forgivable) testing flaw in PDF::API2 (but how many people have both Big-Endian and Little-Endian machines available?).

Quote
Quote
Why is 'Q' treated differently, and I need to give the byte order explicitly?

?? Because it's documented so. By design. How is that "Perl issue"?

I just found it odd that unpack's Q is treated differently than N/V/n/v. Yes, that's a Perl issue that it was implemented that way (dependent upon the machine architecture unless explicitly overridden), and not a PDF issue (everything is Network (Big-Endian) order). Why didn't Perl give Q=unsigned Big-Endian 64 bit, q=Little-Endian, R=signed Big-Endian, and r=signed Little-Endian (or something similar)? q is signed, but Endian-ness still has to be explicitly given.

Quote
It's matter of POV -- the N/V (n/v) pairs are peculiar exception, all other relevant templates require explicit byte order modifier to work in portable manner.

Quote
Is a Parent entry mandatory for a Kid

The Reference has comprehensive Index, I don't think there are any ambiguities where and which entries are required. There are trees of slightly different breeds. E.g., items of Pages Tree, Name Tree (your example), Outlines tree(-like structure) require (1) both Kids/Parent, (2) Kids only, (3) Parent only entries, respectively. Evolving standard (as PDF was) can finish eclectic, which is OK as long as everything is clearly documented.

I've been using the ISO/Adobe final reference for PDF 1.7, 32000_2008.pdf. It has no index. Can you recommend a better PDF reference?

Quote
The "afhacked2.pdf", IIRC, was shown to be horribly broken EXACTLY w.r.t. parental relationship in a tree, why would you pick it up as example to investigate.

I wanted an example of what appeared to be a real-life, in-the-wild, scrambled Parent/Kid relationships, for testing my new validation code. It does flag it as an error, but does not attempt to correct or fix up anything. The validation code is meant to flag suspicious PDFs so that we don't waste so much time trying to fix PDF::API2/Builder "bugs" which are actually bad PDFs in the first place.

Speaking of which, I have several hundred PDFs that I've accumulated over the years, and which I tested against. Many PDFs refer to objects (e.g., /Font 9 0 R) but there is no such object (9 0 obj) in the file. Many, if not all of those cases, appear to have the missing object on the Free List. I take it that it's OK to refer to an object that's on the Free List, and it will just be ignored?
« Last Edit: November 22, 2020, 03:27:39 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 823
    • View Profile
Re: [RT 133131] readval() failure
« Reply #16: November 24, 2020, 08:39:34 AM »
Tue Nov 24 06:41:56 2020 futuramedium@yandex.ru - Correspondence added

Quote
"sister packages"

In general terms:
Code: [Select]
$_ = 'bytes';
$i = hex unpack 'H'.(2*length), $_; # PDF::Tiny
@b = unpack 'C*', $_; $i = 0; ($i <<= 8) += shift @b while @b; # CAM::PDF

Quote
...index. Can you recommend a better PDF reference?

https://en.wikipedia.org/wiki/PDF leads to https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf

Quote
I take it that it's OK to refer to an object that's on the Free List, and it will just be ignored?

See page 64, link above. If object definition is missing (I think it doesn't matter if number is in Free list), then reference refers to null object. If null object is allowed in particular place, definition absence appears to be "ignored". E.g. if /Font entry in graphics state dict is null, then no font is set by gs operator invocation, just wait until Tf operator. Otherwise, I guess it depends on severity, either "ignored" so as not to disturb a user, or reported as error.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 823
    • View Profile
Re: [RT 133131] readval() failure
« Reply #17: November 24, 2020, 08:45:24 AM »
Tue Nov 24 08:35:26 2020 PMPERRY@cpan.org - Correspondence added

On Tue Nov 24 06:41:56 2020, vadimr wrote:
Quote
Quote
"sister packages"

In general terms:
Code: [Select]
$_ = 'bytes';
$i = hex unpack 'H'.(2*length), $_; # PDF::Tiny
@b = unpack 'C*', $_; $i = 0; ($i <<= 8) += shift @b while @b; # CAM::PDF
It looks like they're just doing the same thing as unpack('Q>'), but at a more primitive level. This still doesn't address the problem of what to do if the result doesn't fit in a 32-bit (4-byte) integer (i.e., it overflows -- does Perl switch to double without losing precision?). BTW, I went ahead and 1) handled 5, 6, and 7 byte integers by left-padding with x00, and 2) check if the top 32 bits are x00 and if so, use unpack('N') on the lower half, only calling unpack('Q>') as a last resort (likely to blow up if not Perl-64). I think it will be very rare (for a few years still) to encounter a field that's actually more than 4 billion in value. If 32-bit Perl just handles that as a double, I'll have to revisit this conversion.

Quote
Quote
...index. Can you recommend a better PDF reference?

https://en.wikipedia.org/wiki/PDF leads to https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
Thanks, that looks useful (at least, it has an index!).

Quote
Quote
I take it that it's OK to refer to an object that's on the Free List, and it will just be ignored?

See page 64, link above. If object definition is missing (I think it doesn't matter if number is in Free list), then reference refers to null object. If null object is allowed in particular place, definition absence appears to be "ignored". E.g. if /Font entry in graphics state dict is null, then no font is set by gs operator invocation, just wait until Tf operator. Otherwise, I guess it depends on severity, either "ignored" so as not to disturb a user, or reported as error.
OK, a reference to an undefined object is usually just ignored, unless it leads to the gears getting jammed. I'll make sure that at worst, it's flagged as a 'note' (informational) message, not an error or warning (and thus normally will not be seen).

Thanks for all the info!

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 823
    • View Profile
Re: [RT 133131] readval() failure
« Reply #18: November 24, 2020, 12:34:27 PM »
Tue Nov 24 10:04:23 2020 futuramedium [...] yandex.ru - Correspondence added

Quote
This still doesn't address the problem of what to do if the result doesn't fit in a 32-bit (4-byte) integer (i.e., it overflows -- does Perl switch to double without losing precision?)

I think a person who tries to open a 4 Gb PDF file with 32bit build of Perl is problem himself. Anyway PDF::API2 slurps its input and will fail long before dealing with hypothetical overflow, but the latter is easily investigated with a one-liner.