Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

RT 117184 - Unable to write an opened PDF containing cross-reference streams

  • 22 Replies
  • 2832 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
Thu Apr 25 07:02:10 2019 futuramedium [...] yandex.ru - Correspondence added

Quote
use warnings;
But I always do, I swear! It's File.pm who doesn't! And lexically scoped "use warnings;" in my .pl can't help. OTOH, "-w" switch on command (shebang) line sets global $^W. A bit further off topic, CAM::PDF does "use warnings;", and when I modified it, in quite similar way, for internal use, to write XRef streams some years ago, the case with 65535 was caught. "Transferring" that patch to PDF::API2, I just forgot to zero gennum of 0th object. Sorry. If only I didn't forget, I hadn't to write all of the following :)

Quote
Let me rephrase my question, then -- is the correct result to output the original, unchanged (almost) PDF, and then tack on these new and replacement objects, and maybe a cross reference stream, after the original %%EOF? You seem to have said "yes".
Yes. I'll try to clarify further, sorry if it may sound primitive. For performance reasons, many file formats allow incremental updates, with changes appended to intact original. (Not "almost", but 100% intact.) In GUIs, it's usually "Save" for incremental update, and "SaveAs" for clean re-write, not necessarily to another filename. It's same difference in CAM::PDF's methods "output" and "cleanoutput".

With "cleanoutput", all objects are re-numbered consecutively, getting fresh gennum of "0", and "holes" of ranges of free objects are eliminated. PDF::API2 simply can't do "cleanoutput". If file is opened and then saved, it's always incremental update, however confusing method's name "saveas" is. Even if all objects were changed and original content becomes useless, it's stored intact as it was, and new content is appended.

My patch in #121832 was an attempt to teach PDF::API2 to "cleanoutput". Now I think it's not worth it, since it can only output old-fashioned 1.4 classical XRef table, it tries complex and possibly fragile not-tested-enough things (as opposed to simple patch discussed in this thread), and nobody seems interested.

New patch only serves one purpose: users now can modify "modern" PDF files without necessity to downgrade them to 1.4 as preliminary step, and worrying why Reader can't read their files and whether PDF::API2 works at all, or not. But it's same old incremental update.

Quote
Is that the only place that a value greater than xFF is ever going to show up?
OK, consider this. For gennum to become > 255, PDF file has to be updated at least 510 times. This minimum of 510 is possible if pattern of updates is strictly that each odd save removes an object (objnum marked free on save), and some info (completely new indirect object) is added on each even save (objnum re-used, gennum increased). If this pattern is not strict, gennum ever gets to > 255 after more, possibly much more number of updates. Multiply probability of such scenario by chance that people torturing this file never get worried about file size bloat, so they don't reset the progression by issuing "SaveAs" command somewhere in the middle.

Note, all of the above still happens in 1.4 era, with classical XRef table. If file gets to PDF::API2 in this state with any gennum >255 -- fine, it's not an issue for patch discussed.

If this file is updated to 1.5 before getting to PDF::API2 -- fine too, it was a clean "SaveAs", all gennums reset and never touched again by Acrobat/Reader.

That's why my suggestion was to set gennum to 0 instead of 255 -- it would be same "0" as in files saved with Reader. But in the end it's probably not so important.

Quote
Should we consider treating it as 16-bits, if the standard permits
Is there any software that still tracks free list and re-uses objnums and increases gennums, even in Xref streams, and regardless of Adobe's own stance? I don't know! Have never heard of. I'd solve problems when (and if) they come: if anyone files a bug report and we see that it's because we (wrongly, it will appear) assumed gennums always zero (well, less than 255) in XRef streams -- fine, we'll know how to fix -- i.e. to use 'Cnn' ('CNn') template.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
Both set special case '0 65535 f' to '0 0 f', and added warning and reduction of any generation number in excess of 255 (because it is packed with 'C' code). Closing RT 117184 for PDF::Builder, and fix will appear in today's 3.014 release. Again, thank you Vadim for your work on this. Please consider issuing a Pull Request for PDF::API2.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
Sat May 11 17:56:56 2019 david-dot-warring [...] gmail.org - Correspondence added

I've attached a hybrid PDF. Since they've been mentioned. These ARENT'T breaking under update and don't need special handling. - david
Quote
Phil,

I have better alternative than patch (hack) from #121832. To please Acrobat/Reader, incremental update can append either classical Xref Table or compressed Xref Stream. The new patch seems to be working.

The test PDF file is from this thread.
  • Producing "hybrid files" to ensure "compatibility with older applications" is not implemented (was not even contemplated -- I don't think it's important anymore).
  • No support (with this patch, but would not be difficult in general) for files > ~4 Gb.
  • Somewhat lousy compression (because of no prediction) if someone updates unusually large number of objects -- i.e. generally unlikely).
  • Of course, updated objects are not stuffed into streams, and furthermore this patch does nothing to "use modern compression" when file is clean-output (IIRC, PDF::API2 can't do it anyway).
  • Important -- this patch also applies changes (2 topmost changes) as per #121911.
In fact, fixes are very minimal, existing code is mostly re-used to collect updates made to XRef Table (instead of writing them as they come) and then apply them appropriately in either of 2 modes.

 +  One (minor) digression: documentation could be more clear that after calling "saveas" an instance becomes unusable -- to prevent someone writing scripts e.g. such as with commented fragment below.

Code: [Select]
use warnings;
use strict;
use feature 'say';
 
use PDF::API2;
 
my $pdf = PDF::API2-> open( "test.pdf" );
$pdf-> page;
$pdf-> page;
$pdf-> page;
 
$pdf-> saveas( "test-mod.pdf" );
 
# $pdf-> page;
# $pdf-> page;
# $pdf-> saveas( "test-mod++.pdf" );
 
__END__

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
Mon May 13 07:40:17 2019 futuramedium@yandex.ru - Correspondence added

Thanks for testing, David. To handle hybrid files, this patch needs yet another tweak :-(.
Code: [Select]
my $pdf = PDF::API2-> open( 'hybrid.pdf' );

#delete $pdf-> { pdf }{ XRefStm };

$pdf-> openpage( 1 )-> rotate( 180 );
$pdf-> saveas( 'hybrid+.pdf' );

$pdf = PDF::API2-> open( 'hybrid+.pdf' );
$pdf-> page;
$pdf-> saveas( 'hybrid++.pdf' );

1st page reverts to unrotated state -- because, according to the Reference, the "XRefStm" must be consulted first, before descending the "Prev"'s chain (alas, Chrome is broken). So this entry should be deleted in trailers of appended sections. I.e.
Code: [Select]
delete $tdict-> { XRefStm };
inserted into "else" clause of the above patch.

Further (NOT related to patch discussed, but revealed because of "hybrid.pdf"), PDF::API2 appends new content quite literally: "%%EOF3 0 obj << /Type /Page ..." etc. Though offset for object 3 is correct and no applications seem to complain, it's ugly and I doubt it's valid syntax, really, and better be fixed, i.e., ensure newline before appending.

These changes aren't urgent, documented for the future.

Mon May 13 16:52:14 2019 PMPERRY@cpan.org - Correspondence added

Hi Vadim,

Yeah, I caught the run-on %%EOF problem and fixed it yesterday (in PDF::Builder) by ensuring that an opened PDF ends with an EOL beyond the original final %%EOF (since new code will be appended).

As for the rest of this stuff, I'm a bit confused. Do you anticipate having to patch PDF::API2 & Builder to do something with XRefStm in new trailers? How critical is this -- should I delay 3.015 release until the new patch?

Is this only something that affects the Chrome PDF reader, or does it affect Acrobat Reader (and many other readers) too?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
Tue May 14 06:08:12 2019 futuramedium@yandex.ru - Correspondence added

Phil, actually it's new and unrelated issue that can affect hybrid files. Perhaps not very critical, as it always was there. As example shows, it is easy to modify a file so that XRefStm points to outdated information. This key simply must not be preserved. To fix, single line of code can be added immediately after "else" line. I mentioned Chrome just as a fun fact, it does not follow specification strictly, which appears to "cancel out" the issue and possibly adds to confusion.

Tue May 14 10:45:59 2019 PMPERRY@cpan.org - Correspondence added

OK, so
Code: [Select]
+ else {
+     $fh->print("xref\n", @out, "trailer\n");
+     $tdict->outobjdeep($fh, $self);
+     $fh->print("\n");
+ }
+ $fh->print("startxref\n$tloc\n%%EOF\n");
}

should become
Code: [Select]
+ else {
+     delete $tdict->{'XRefStm'};
+     $fh->print("xref\n", @out, "trailer\n");
+     $tdict->outobjdeep($fh, $self);
+     $fh->print("\n");
+ }
+ $fh->print("startxref\n$tloc\n%%EOF\n");
}

? Does this assume there is already a XRefStm entry in the existing PDF (that we want to use)? Should there be a check added that there is, before deleting the new one, or is it safe to assume there always is an existing one?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
Tue May 14 18:15:28 2019 futuramedium@yandex.ru - Correspondence added

The code change is correct. *Maybe* (I now think) it would be better to move this line into sub's caller, where $tdict is created by copying the existing trailer dictionary, and just get rid of XRefStm unconditionally.

Of course not every PDF contains "XRefStm"! Deleting non-existent hash elements is safe and a no-op. If (from maintenance POV?) you'd prefer "delete something{foo} if exists something{foo}" -- OK, write it so. For me it's more effort to read and tautology, in a sense.

Wed May 15 09:17:26 2019 PMPERRY@cpan.org - Correspondence added

Quote
The code change is correct. *Maybe* (I now think) it would be better to move this line into sub's caller, where $tdict is created by copying the existing trailer dictionary, and just get rid of XRefStm unconditionally.
Code efficiency improvement, or change of behavior?

Quote
Of course not every PDF contains "XRefStm"! Deleting non-existent hash elements is safe and a no-op. If (from maintenance POV?) you'd prefer "delete something{foo} if exists something{foo}" -- OK, write it so. For me it's more effort to read and tautology, in a sense.
My point (which I guess I didn't make clearly enough) was that if the existing PDF we're appending to did NOT already have an XRefStm, what is lost (or gained) by unconditionally NOT putting in a new one? If the existing PDF did not have one, and we add a cross reference stream but no XRefStm, what are the consequences? If it DID have one, what are the consequences of adding a second one? I just want to be clear in my mind on all these angles before I make your code change.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
Thu May 16 08:08:04 2019 futuramedium [...] yandex.ru - Correspondence added

Quote
Code efficiency improvement, or change of behavior?

Neither, just keeping related things close together, for coherence/ease of maintenance. Decisions about content of trailer of new section, including what entries from existing trailer to keep, are made very near $tdict creation at line 341 (https://metacpan.org/release/PDF-API2/source/lib/PDF/API2/Basic/PDF/File.pm#L341). No reason to delete XRefStm 1000+ LOCs away. But it's not very important.

Reference:

Quote
Note: Table 3.17 defines an additional entry, XRefStm, that appears only in the trailer of hybrid-reference files, described in “Compatibility with Applications That Do Not Support PDF 1.5” on page 109.

1.4-compatible consumer doesn't know about "only" (nor, of course, what to do with XRefStm), but:

Quote
The added trailer contains all the entries (perhaps modified) from the previous trailer, as well as a Prev entry giving the location of the previous cross-reference section...

PDF::API2 is 1.5-consumer, and it can't save hybrid files, therefore neither "old" XRefStm is kept in appended sections (see rotated page example), nor "new" one is added. That "hybrid" was bad design idea, on Adobe side, anyway, but that's too much off topic. I hope the above quotes answer your other questions. Just consider how pdf-reader follows xrefs sections chain, looking for /Prev in trailers and (since it's 1.5-compatible) for /XrefStm, too -- /XRefStm is checked first before descending further, if object wasn't found yet.

That original section remains to be "hybrid". What we are appending is "classic". If original was not "hybrid" but pure "1.5 xref stream", then we are appending pure "xref stream" section and XRefStm is not required.
« Last Edit: May 16, 2019, 10:42:14 AM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 551
    • View Profile
delete has been added for release 3.015.