Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

[RT 136648] a simple open() + save() adds extra content past the EOF...

  • 7 Replies
  • 63 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
Subject: | a simple open() + save() adds extra content past the EOF, causing Adobe Reader to repair the file

To: | bug-PDF-API2@rt.cpan.org
From: | chrispy@synopsys.com
Date: | Fri, 28 May 2021 21:13:13 -0400
If I do a simple open/save:
Code: [Select]
#!/usr/bin/perl
use PDF::API2;
my $pdf = PDF::API2->open('orig.pdf');
$pdf->saveas('rewritten.pdf');
the output file is identical to the input file, except for new content added after the original %%EOF (note the double %%EOF now):
Code: [Select]
6737
%%EOF
xref
0 1
0000000000 65535 f
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <a2e78ef36ff1bc1c88233f0a2a324a39> <a2e78ef36ff1bc1c88233f0a2a324a39> ] /Info 1 0 R /Length 183 /Prev 6737 /Root 25 0 R /Size 64 /W [ 1 8 2 ] >>
startxref
7164
%%EOF
When the resulting file is opened in Adobe Reader, it is repaired (and a repair dialog appears/disappears very quickly). When the file is closed, Adobe Reader prompts to save the repaired/updated file.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
at May 29 10:59:45 2021 PMPERRY@cpan.org - Correspondence added

Interesting. I tried this with the latest (2.040) PDF::API2, and it gave me the error

Error opening 'orig.pdf': Permission denied at C:/Strawberry/perl/site/lib/PDF/API2/Basic/PDF/File.pm line 231.

The orig.pdf file is Read/Write (I checked with the DOS attrib command, on Windows 10), so I don't know what it's complaining about, or why it's different from your run. Recent PDF::API2 releases (at least as far back as 2.039) check if you're trying to update a Read/Only PDF, so I don't know why it's unhappy. What version are you running?

I tried it with PDF::Builder 3.023-beta, and it ran OK. It did give warnings that objects 19-23, and 30, are children (Kids) of objects 24 and 26, but do not declare their Parent. That may or may not cause problems. Also, I see that orig.pdf, while declared to be version 1.4 (with 1.5 Version override), uses a cross reference stream (PDF-1.5) rather than a table. That might have something to do with the new object 64 (cross reference stream) appended to the end of the file. The rewritten.pdf appeared to be clean -- Adobe Acrobat Reader did not ask to save a "fixed" copy.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
Sat May 29 12:00:14 2021 chrispitude@gmail.com - Correspondence added

Well that is strange! File permissions shouldn't be preserved through site attachments, so something else must be going on. I'm currently on Ubuntu 20.04, using the latest PDF::API2 (2.040 to test the fix for "133131: Fix incorrect endianness of 64-bit XRef stream entry widths"). I'll try installing Strawberry Perl in Windows 10 to see if I can reproduce the behavior.

PDF::Builder also writes extra content at the end, but (1) it's slightly different:
Code: [Select]
6737
%%EOF

64 0 obj << /Type /XRef /DecodeParms << /Columns 4 /Predictor 12 >> /Filter /FlateDecode /ID [ <a2e78ef36ff1bc1c88233f0a2a324a39> <a2e78ef36ff1bc1c88233f0a2a324a39> ] /Index [ 0 1 64 1 ] /Info 1 0 R /Length 16 /Prev 6737 /Root 25 0 R /Size 65 /W [ 1 2 1 ] >>
stream
xÚcb
and (2) it doesn't provoke Adobe Reader to repair the file.

I don't know enough about PDF to understand the difference, but I am attaching both output files if you're curious to have a look.

PDF::Builder also issues the following messages:
Quote
    PDF Integrity Check: object 24.0 claims 19.0 as a child (/Kids), but 19.0 claims no Parent!
    PDF Integrity Check: object 24.0 claims 20.0 as a child (/Kids), but 20.0 claims no Parent!
    PDF Integrity Check: object 24.0 claims 21.0 as a child (/Kids), but 21.0 claims no Parent!
    PDF Integrity Check: object 24.0 claims 22.0 as a child (/Kids), but 22.0 claims no Parent!
    PDF Integrity Check: object 24.0 claims 23.0 as a child (/Kids), but 23.0 claims no Parent!
    PDF Integrity Check: object 26.0 claims 30.0 as a child (/Kids), but 30.0 claims no Parent!
Are these messages something I should relay back to the software developer? (The application is Oxygen XML Author, which uses Apache FOP internally for the actual publishing.)

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
Sat May 29 13:24:02 2021 PMPERRY@cpan.org - Correspondence added

Regarding the "no parent" messages, I would treat that as a "slightly suspicious" point that MIGHT give a clue where to look if nothing else pans out. You might bring it to the attention of the application developer, that it's generally good practice for a /Kid to declare their /Parent, although I don't think it's strictly required.

If the application is going to output a PDF with 1.5 features (such as a cross reference stream), it's desirable to make the version at the top 1.5 rather than 1.4, although setting the /Version in the Root object is legal. The orig.pdf is perfectly legal; I don't know if changing to 1.5 up top will make any difference to PDF::API2 (since I can't seem to test it). PDF::API2 is supposed to accept XRef Streams.

I find it suspicious that PDF::API2 created a new XRef Stream, but unlike PDF::Builder, it didn't seem to provide a stream for the object! Builder, like the original file's object 63 XRef Stream, provided a data stream for object 64 (the updated XRef Stream) AND made it an object, while API2 didn't do either. That may be a bug in API2; Steve will have to look at it.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
Sat May 29 13:54:51 2021 chrispitude@gmail.com - Correspondence added

Phil, thanks for your suggestions! I'll forward this to the application developer and let them feed it into the Apache FOP support machine (if they choose to).

Just curious - why did PDF::Builder append a new object at all, if we're just opening and rewriting the same content?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
Sat May 29 14:10:40 2021 PMPERRY@cpan.org - Correspondence added
Quote
Just curious - why did PDF::Builder append a new object at all, if we're just opening and rewriting the same content?
That's behavior inherited from PDF::API2. I presume it has something to do with orig.pdf being declared PDF 1.4 and then a 1.5 feature (cross reference stream) being found, triggering it to add something (a new XRef Stream) at the end, even though nothing was changed. As I said earlier, I think PDF::API2 got it wrong and should have added an object (with new data stream) rather than the non-object with no stream, but Steve will have to determine that.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
Sat May 29 14:35:55 2021 chrispitude@gmail.com - Correspondence added

Hi Phil,

My publishing software has a knob to specify the output PDF version, so I sent it to 1.5 (new orig1.5.pdf file attached). I get the same behavior:
  • Both PDF::API2 and PDF::Builder append content to the end.
  • rewritten_from_API2_1.5.pdf causes Adobe Reader to prompt to save the repaired file when closing.
  • rewritten_from_Builder_1.4.pdf opens and closes in Adobe Reader with no prompt.
Hi Steve,

There might be something in this content at the end that needs to be written differently so that Adobe Reader doesn't feel the need to repair the file.

Also, what is the purpose of this new content appended at the end?
Code: [Select]
%%EOF
xref
0 1
0000000000 65535 f
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <ca19b24f9cf4829de2abd3299bbf3130> <ca19b24f9cf4829de2abd3299bbf3130> ] /Info 1 0 R /Length 144 /Prev 5884 /Root 19 0 R /Size 44 /W [ 1 8 2 ] >>
startxref
6272
%%EOF
Interestingly, if I load the rewritten file and rewrite it again:
Code: [Select]
my $pdf = PDF::API2->open('rewritten_from_API2_1.5.pdf');
$pdf->saveas('rewritten_from_API2_twice_1.5.pdf');
...an additional content chunk is appended at the end:
Code: [Select]
%%EOF
xref
0 1
0000000000 65535 f
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <ca19b24f9cf4829de2abd3299bbf3130> <ca19b24f9cf4829de2abd3299bbf3130> ] /Info 1 0 R /Length 144 /Prev 5884 /Root 19 0 R /Size 44 /W [ 1 8 2 ] >>
startxref
6272
%%EOF
xref
0 1
0000000000 65535 f
trailer
<< /Type /XRef /Filter /FlateDecode /ID [ <ca19b24f9cf4829de2abd3299bbf3130> <ca19b24f9cf4829de2abd3299bbf3130> ] /Info 1 0 R /Length 144 /Prev 6272 /Root 19 0 R /Size 44 /W [ 1 8 2 ] >>
startxref
6517
%%EOF

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 810
    • View Profile
Wed Jun 16 14:00:06 2021 PMPERRY@cpan.org - Correspondence added

I certainly agree that if the original PDF is correct, there should be no reason to add on more content at the end in either PDF:::API2 or PDF::Builder. Note that the original should be absolutely correct -- if any reader asks permission to "save" it after opening, that means it thinks it found an error. If Builder is otherwise acting correctly (produces a legitimate, working PDF), I'll consider that to be a very minor bug, and probably won't get to it for a long time. I see that like API2, Builder adds more content (another xref stream) on a second open, even though the line 1 version was "1.5" this time around. In contrast, it looks like API2 may be botching its "repair", so Steve will have to address that one.