Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

RT 113293 - Corrupted PDF produced by importPageIntoForm

  • 0 Replies
  • 904 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 437
    • View Profile
RT 113293 - Corrupted PDF produced by importPageIntoForm
« October 21, 2016, 02:03:31 PM »
Wed Mar 23 11:09:38 2016 melmothx [...] gmail.com - Ticket created
Subject:    Corrupted PDF produced by importPageIntoForm
Date:    Wed, 23 Mar 2016 16:09:15 +0100
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Marco Pessotto <melmothx [...] gmail.com>

Hi there!

It looks like I've hit a bug, but honestly I don't know where to start looking.

Attached you can find two (apparently) identical PDFs, one produced by xetex and one by luatex. When I import the one produced by luatex, the second page get lost.

Evince just displays an empty page, while mupdf says:

error: stack overflow
warning: Ignoring errors during rendering
mupdf: warning: Errors found on page

Code to reproduce:

Code: [Select]
#!perl
use strict;
use warnings;
use PDF::API2;

for my $file (qw/xetex luatex/) {
    my $in = PDF::API2->open("t/resources/$file.pdf");
    my $out = PDF::API2->new;
    my $page = $out->page;
    my $gfx = $page->gfx;
    for my $p (1, 2) {
        if (my $included = $out->importPageIntoForm($in, $p)) {
            print "Including $p\n";
            $gfx->formimage($included, 100 * $p, 0)
        }
    }
    $out->saveas("t/resources/$file.out.pdf");
    $out->end;
    $in->end;
}

Please let know if I can be of any help here.

Best wishes

--
Marco
#
Subject:    [rt.cpan.org #113293]
Date:    Thu, 24 Mar 2016 10:48:00 -0400
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry

Do you happen to know what PDF version is being output by XeTeX and LuaTeX? If they're greater than 1.3 (or maybe 1.4), they may have content that breaks PDF::API2. If you can throw a switch somewhere to output PDF 1.3 on these programs, could you see if the problem persists? One thing PDF::API2 needs is work to bring it up to PDF 1.7 compatibility, so it can safely import/load PDFs produced by other programs.
#
Thu Mar 24 10:47:50 2016 The RT System itself - Status changed from 'new' to 'open'
#
Thu Mar 24 11:12:31 2016 melmothx [...] gmail.com - Correspondence added
Subject:    Re: [rt.cpan.org #113293]
Date:    Thu, 24 Mar 2016 16:12:17 +0100
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Marco Pessotto <melmothx [...] gmail.com>

XeTeX and LuaTeX usually give 1.5 output. With 1.4 output, things are fine. The files I sent are 1.5. The puzzling thing is that 1.5 produced by XeTeX is fine as well (as you can see from the tests), so yes, apparently luatex is including some content (which one is a good question) the module doesn't like (but doesn't throw an exception either).

--
Marco
#
Sat Mar 26 10:45:57 2016 steve [...] deefs.net - Correspondence added

Here's what I've found so far:

Using Perl 5.8.5 (the minimum supported version), your test script seg faults while trying to create luatex.out.pdf.  It works fine with Perl 5.14.2 (my system Perl).

Looking through both PDFs using contrib/pdf-debug.pl, they're both basically identical until we get to the XObject (form image) on page 2.  For luatex.out.pdf, instead of including object 13 (/Contents of page 2), it's including object 2 (an /ObjStm).  That explains why the resulting PDF is broken.

Looking at luatex.pdf, object 13 is the second object in the file, so I'm wondering if the object stream contents are getting saved in the wrong slot and overwriting the page contents.  It could just be coincidence, but it's where I'll be looking next.
#
Thu Jun 02 09:55:08 2016 steve [...] deefs.net - Correspondence added

Here's a brief update on this ticket: I spent a large chunk of yesterday trying to isolate the problem, but haven't gotten there yet.

Observations:

- The Page content stream in the source PDF is getting replaced by an object stream during an unrelated operation (copying a different object from the source PDF to the target PDF).

- With the provided luatex.pdf, the problem always happens on the second page to be included (reversing the order from 1, 2 to 2, 1 causes page 1 to be corrupt).

- importPageIntoForm calls walk_obj (see the $ssk foreach loop) which calls itself recursively.  It calls the realise method on the source object if it's an indirect object.  On the second page to be included, inside the first time importPageIntoForm calls walk_obj, but not necessarily the outer walk_obj call, this realise call is the one that causes the object stream to get embedded in the wrong place in the source PDF.

- The realise method just calls read_obj from .../Basic/PDF/FIle.pm

- read_obj calls read_objnum and merge.  The merge function doesn't appear to be the problem, and it would make sense for read_objnum to have the bug since it's dealing with object streams.

That's as far as I got before running out of time yesterday.  It's possible I'll be able to work on it some more next week, but after that I'll be away until July.  There will be another PDF::API2 release as soon as this issue gets fixed, so if anyone reading this wants to take the above and run with it, I'll happily give you full credit for a patch fixing the bug.  :-)
#
Thu Jun 02 11:33:06 2016 melmothx [...] gmail.com - Correspondence added
Subject:    Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:    Thu, 02 Jun 2016 17:32:54 +0200
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Marco Pessotto <melmothx [...] gmail.com>

Hello Steve,

I'll try to give a shoot at it as well when I'll grab some brain-fresh time (probably over the weekend).

Best wishes

--
Marco
#
Fri Jun 03 15:08:04 2016 melmothx [...] gmail.com - Correspondence added
Subject:    Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:    Fri, 03 Jun 2016 21:07:53 +0200
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Marco Pessotto <melmothx [...] gmail.com>

Some notes, hoping it will ring any bell:

In the $ssk loop, it looks like the contents of the page changes, even if it's not directly passed to the function:

Code: [Select]
                print "Including resource <$s_pdf $k $sk $ssk> $xo->resource($sk,$ssk,walk_obj($self->{apiimportcache}->{$s_pdf},$s_pdf->{pdf},$self->{pdf},$s_page->{$k}->{$sk}->{$ssk})\n";
                {
                    my  ($content) = $s_page->{Contents}->elementsof;
                    print "Page content (before embedding $ssk $s_page->{$k}->{$sk}->{$ssk}) is " . ref($content) . "\n";
                }

                my $walked = walk_obj($self->{apiimportcache}->{$s_pdf},
                                      $s_pdf->{pdf},
                                      $self->{pdf},
                                      $s_page->{$k}->{$sk}->{$ssk});
                {
                    my  ($content) = $s_page->{Contents}->elementsof;
                    print "Page content (embedding $ssk) is " . ref($content) . "\n";
                }
                $xo->resource($sk,$ssk,$walked);

And yields:
Code: [Select]
Page content (before embedding Im1 PDF::API2::Basic::PDF::Objind=HASH(0x192e8b0)) is PDF::API2::Content
Realising objind 10
Realising objind 16
Realising objind 15
Page content (embedding Im1) is PDF::API2::Content

Page content (before embedding F25 PDF::API2::Basic::PDF::Objind=HASH(0x192df50)) is PDF::API2::Content
Realising objind 8
Realising objind 17
Realising objind 20
Realising objind 18
Realising objind 19
Realising objind 21
Page content (embedding F25) is PDF::API2::Basic::PDF::Dict
Swapping the pages:
Code: [Select]
Page content (before embedding F25 PDF::API2::Basic::PDF::Objind=HASH(0x36407e8)) is PDF::API2::Content
Realising objind 8
Realising objind 21
Realising objind 17
Realising objind 20
Realising objind 19
Realising objind 18
Page content (embedding F25) is PDF::API2::Content

Page content (before embedding Im1 PDF::API2::Basic::PDF::Objind=HASH(0x3601a80)) is PDF::API2::Content
Realising objind 10
Realising objind 15
Realising objind 16
Page content (embedding Im1) is PDF::API2::Basic::PDF::Dict

With the xetex file, the page contents is always a Content object. That
Dict is the object stream which causes the corruption.

So much for today

--
Marco
#
Tue Jun 07 13:31:25 2016 melmothx [...] gmail.com - Correspondence added
Subject:    Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:    Tue, 07 Jun 2016 19:31:08 +0200
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Marco Pessotto <melmothx [...] gmail.com>
Download (untitled) / with headers

Ok, I think at least I found where the problem is:

Code: [Select]
diff --git a/lib/PDF/API2/Basic/PDF/File.pm b/lib/PDF/API2/Basic/PDF/File.pm
index 0544aca..bd9fca7 100644
--- a/lib/PDF/API2/Basic/PDF/File.pm
+++ b/lib/PDF/API2/Basic/PDF/File.pm
@@ -545,14 +545,9 @@ sub readval {
         $value = $2;
         $str =~ s/^([0-9]+)$ws_char+([0-9]+)$ws_char+obj//s;
         ($obj, $str) = $self->readval($str, %opts, 'objnum' => $num, 'objgen' => $value);
-        if ($result = $self->test_obj($num, $value)) {
-            $result->merge($obj);
-        }
-        else {
             $result = $obj;
             $self->add_obj($result, $num, $value);
             $result->{' realised'} = 1;
-        }
         $str = update($fh, $str) if $update;       # thanks to kundrat@kundrat.sk
         $str =~ s/^endobj//;
     }

This patch appear to fix the issue in the testfile, but it's totally unclear if it has side effects or if it's doing the right thing.

Let's consider this a starting point.

--
Marco
#
Tue Jun 07 13:41:59 2016 melmothx [...] gmail.com - Correspondence added
Subject:    Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:    Tue, 07 Jun 2016 19:41:44 +0200
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Marco Pessotto <melmothx [...] gmail.com>

And some notes about the readval method, which is called recursively.

1. From read_objnum we pass these options:

 ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0);

However, in the readval code, I can't find an usage for objnum and
objgen, so it seems to me they are ignored, but could be mistaken.

I suspect those values not obeyed is what lead to the caching problem.


2. From read_objnum

  my $stream = "$num 0 obj" . substr($objects, $start, $length);
  ($object) = $self->readval($stream, %opts, objnum => $num, objgen => $gen, update => 0);

Unclear if that "$num 0 obj" should be "$num $gen obj" instead. I don't
have a PDF with some object revisions, so dunno, just submitted to the
attentions.

3. In readval

  my $update = defined($opts{update}) ? $opts{update} : 1;
  $str = update($fh, $str);

While elsewere we have  $str = update($fh, $str) if $update;
So at the begin of the method, update is called even if update => 0 is
passed.

Point 2. and 3. are unrelated to the issue in the ticket, and are
probably worth a comment in the code or in the doc to clarify it for the
next souls wondering in this area of the code.

I'll continue digging on the issue over the next days, assuming I'll
have some spare time to throw at this.

Cheers

--
Marco
#
Wed Jun 08 11:16:06 2016 melmothx [...] gmail.com - Correspondence added
Subject:    Re: [rt.cpan.org #113293] Corrupted PDF produced by importPageIntoForm
Date:    Wed, 08 Jun 2016 17:15:53 +0200
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Marco Pessotto <melmothx [...] gmail.com>

It looks like the patch is making the pdf.t going into very-very deep recursion, so no, definitively it's not doing the right thing. Kind of
stuck here.

--
Marco
#
Thu Oct 06 23:05:54 2016 steve [...] deefs.net - Correspondence added

Found it.  The problem actually got introduced when the cross-reference stream was originally read, due to this gotcha in readxrtr:
Code: [Select]
for $xmin ($start...$last) {
    # ...
}
# ...
$self->{' maxobj'} = $xmin;

The problem is that the "$xmin" in the for loop is an implicit local variable, so the $xmin outside the loop was still the default (1) instead of $last (27 in your file).  When new_obj gets called later on, it increments maxobj to 2, then clobbers what was in that spot, which happened to be the object stream in your file.  Everything we saw after that followed from that mistake.

The repo has the fix, and it'll be in the next release.  And I'll be very happy to stop banging my head on this problem.  :-)
#
Thu Oct 06 23:06:05 2016 steve [...] deefs.net - Status changed from 'open' to 'resolved'
#
Thu Oct 06 23:19:59 2016 steve [...] deefs.net - Correspondence added

Quote
1. From read_objnum we pass these options:

 ($object) = $self->readval($stream, %opts, objnum => $num, objgen =>
 $gen, update => 0);

 However, in the readval code, I can't find an usage for objnum and > objgen, so it seems to me they are ignored, but could be mistaken.

I suspect those values not obeyed is what lead to the caching problem.
They do indeed seem to be ignored.  This ended up being unrelated, though.

There doesn't seem to be anywhere else in the codebase that uses them either, so I'm removing them to avoid confusion later.
 
Quote
2. From read_objnum
Code: [Select]
my $stream = "$num 0 obj" . substr($objects, $start, $length);
 ($object) = $self->readval($stream, %opts, objnum => $num, objgen =>
$gen, update => 0);

Unclear if that "$num 0 obj" should be "$num $gen obj" instead.
The generation number in object streams is always going to be zero, so even if objgen weren't being ignored, $gen would always be zero.

Quote
3. In readval
Code: [Select]
my $update = defined($opts{update}) ? $opts{update} : 1;
$str = update($fh, $str);

While elsewere we have $str = update($fh, $str) if $update; So at the begin of the method, update is called even if update => 0 is
passed.
Hmm, good catch.  It didn't end up mattering for this bug because we're passing the whole stream, so it was just adding garbage to the end of it, which is probably going to be the case any time we set $update to 0 anyway, but I've added the "if $update" to that line as well.
#
Thu Oct 06 23:23:58 2016 steve [...] deefs.net - Status changed from 'resolved' to 'patched'
#
Mon Oct 10 09:23:41 2016 steve [...] deefs.net - Status changed from 'patched' to 'resolved'
#
Mon Oct 10 09:23:42 2016 steve [...] deefs.net - Broken in 2.026 added
#
Mon Oct 10 09:23:42 2016 steve [...] deefs.net - Fixed in 2.029 added