Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

RT 117184 - Unable to write an opened PDF containing cross-reference streams

  • 22 Replies
  • 3137 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Fri Mar 11 11:03:10 2016 profires [...] gmail.com - Ticket #112932: Ticket created
Subject:    Can't call method "outobjdeep" in 2.026
 
Hi,

I'm using the library on a simple script to update the info structure of pdf files.
References:
Perl version: strawberry version 5.18.2
OS: Windows 7 Enterprise

PDF input files are produced by a java program with PDF version 1.4 (and I never had problem with these).

The issue happens sometimes, when users add annotation with Acrobat Reader and this implies that the PDF version becomes 1.7 (according to the Acrobat Reader that they use)

After this modification we get the known old bug #48683, as we are using PDF-API2-2.021

I've just tried the new released PDF-API2-2.026 to check actual evolution and I obtain a new error message:
<<
Can't call method "outobjdeep" on an undefined value at D:/tm_programs/perl_portable_pdf/perl/site/lib/PDF/API2/Basic/PDF/Objind.pm line 170.
 
Below an extract from my sample script:
<<
Code: [Select]
my $pdf = PDF::API2->open($source) or die "Can't open PDF file $source: $!";
my $nowDate     = strftime( "%Y%m%d%H%M%S", localtime());

my  %h = $pdf->info(
        'CreationDate' => $nowDate,
    );
$pdf->saveas($source);

As this is my first time reporting a bug, please apologize for any mistake.
#
Tue Mar 15 15:07:43 2016 steve [...] deefs.net - Ticket #112932: Correspondence added
 
Are you able to attach a PDF that demonstrates this problem?  If you'd rather it not be publicly visible, you can instead send one to me privately.
#
Tue Mar 15 15:07:43 2016 The RT System itself - Ticket #112932: Status changed from 'new' to 'open'
#
Thu Mar 17 12:33:25 2016 profires [...] gmail.com - Ticket #112932: Correspondence added
Subject:    Re: [rt.cpan.org #112932] Can't call method "outobjdeep" in 2.026
Date:    Thu, 17 Mar 2016 16:33:03 +0000
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Francesco Fiorentino <profires [...] gmail.com>
 
In attachment a sample.pdf where the attached perl script (test.pl) works correctly and a modified one (sampleMod.pdf) where I have the listed error message. The sampleMod is obtained adding an highlight with Adobe Reader XI and saving it.

« Last Edit: March 07, 2019, 07:41:58 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
#
Tue Apr 26 04:07:46 2016 profires [...] gmail.com - Ticket #112932: Correspondence added
Subject:    Re: [rt.cpan.org #112932] Can't call method "outobjdeep" in 2.026
Date:    Tue, 26 Apr 2016 08:07:25 +0000
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Francesco Fiorentino <profires [...] gmail.com>
 
Hi,

with the 2.027 released, I see that the error message is no more present, but, using the same input attached previously, it produces an unreadable file.

Thanks,
Francesco

#
Wed Jun 01 10:12:24 2016 profires [...] gmail.com - Ticket #112932: Correspondence added
Subject:    Re: [rt.cpan.org #112932] Can't call method "outobjdeep" in 2.026
Date:    Wed, 01 Jun 2016 14:12:03 +0000
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Francesco Fiorentino <profires [...] gmail.com>
 
Any feedback about that?

#
Thu Jun 02 09:55:22 2016 steve [...] deefs.net - Ticket #112932: Correspondence added

I suspect that it's the same issue as ticket #113293.
#
Tue Jun 07 15:00:48 2016 MELMOTHX [...] cpan.org - Ticket #112932: Correspondence added
 
Actually, the issue seems unrelated. End of the modified PDF:

Code: [Select]
startxref
116
%%EOF
8 0 obj << /CreationDate (20160607205416) /Creator (Apache FOP Version 1.1) /ModDate (D:20160317171139+01'00') /PDFVersion (1.4) /Producer (Apache FOP Version 1.1) >> endobj
xref
0 1
0000000000 65535 f
8 1
0000009549 00000 n
trailer
<< /Type /XRef /DecodeParms << /Columns 4 /Predictor 12 >> /Filter /FlateDecode /ID [ <951086a159fa774291c81f007ad52c0e> <d0fd218e4aa35740b313e56bfd43b2db> ] /Index [ 9 18 ] /Info 8 0 R /Length 60 /Prev 116 /Root 10 0 R /Size 1 /W [ 1 2 1 ] >>
startxref
9723
%%EOF

It looks like the code just appends this code and keeps the original PDF verbatim, at first glance, hence the breakage.
#
Wed Aug 24 05:06:45 2016 dietrich.streifert [...] googlemail.com - Ticket created
Subject:    Simply opening and saving a multipage PDF file corrupts the file
Date:    Wed, 24 Aug 2016 11:06:30 +0200
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Dietrich Streifert <dietrich.streifert [...] googlemail.com>
 
This is for perl 5.16 on centos 7.2 using a simple test file (filename "test.pdf" ) with four pages:

The following code
Code: [Select]
my $pdf   = PDF::API2->open("test.pdf");
$pdf->saveas("test-mod.pdf");
$pdf->end;
generates a corrupt file "test-mod.pdf" which is not readable any more by e.g. Acrobat Reader, which reports that the document can not be opened (code 14).

This behaviour makes PDF::API2 unusable for even the simplest modifications.

I've attached both the perl code and the test file (don't know if this gets through the email bug submission at rt.cpan.org)

#
Subject:    [rt.cpan.org #117184]
Date:    Wed, 24 Aug 2016 10:12:45 -0400
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry
 
I see that the original test.pdf is PDF version 1.5. Maybe there's something in there that got corrupted when reading into PDF::API2. Is it possible to create your test.pdf in version 1.4 or even 1.3? Admittedly that's not a great solution -- PDF::API2 needs to be brought into the 21st century and handle up to version 1.7 correctly -- but it may do for the time being.
#
Wed Aug 24 10:13:02 2016 The RT System itself - Status changed from 'new' to 'open'
#
Wed Aug 24 10:27:59 2016 dietrich.streifert [...] googlemail.com - Correspondence added
Subject:    Re: [rt.cpan.org #117184]
Date:    Wed, 24 Aug 2016 16:27:45 +0200
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Dietrich Streifert <dietrich.streifert [...] googlemail.com>
 
You're right! It works if I convert test.pdf to PDF-Version 1.4.
#
Thu Oct 06 23:34:09 2016 steve [...] deefs.net - Subject changed from 'Simply opening and saving a multipage PDF file corrupts the file' to 'Unable to write an opened PDF containing cross-reference streams'
#
Thu Oct 06 23:34:09 2016 steve [...] deefs.net - Severity Wishlist added
#
Thu Oct 06 23:42:42 2016 steve [...] deefs.net - Correspondence added

PDF::API2 got support for reading files with cross-reference streams in version 2.026, but it doesn't yet support writing those files.

The easiest way to implement this would be to convert the object stream to regular objects and save the file normally.  That would eliminate the need to teach PDF::API2 how to write a cross-reference stream, though that's the other option.  Doing so will typically produce a file that's a little smaller, but it isn't necessary.

As a workaround until someone adds that support, you can use importPageIntoForm to copy each page into a new PDF file, or use other copy methods to get the data from the original file to a new one.
#
Fri Oct 07 00:21:57 2016 steve [...] deefs.net - Ticket #112932: Correspondence added

On Thu Jun 02 09:55:22 2016, SSIMMS wrote:

Ok, not #113293, but it does appear to be the same as #117184.  sampleMod.pdf contains a cross-reference stream.  PDF::API2 can read them as of version 2.026, but it doesn't know how to write a cross-reference stream yet, nor how to convert from a cross-reference stream to a cross-reference table (which would likely be the easier of the two to implement).

A potential solution and a workaround are given in ticket #117184.
#
Fri Oct 07 00:22:50 2016 steve [...] deefs.net - Ticket #112932: Merged into ticket #117184
#
Fri Oct 07 00:22:50 2016 steve [...] deefs.net - Merged into ticket #117184

<formatting cleanup - Mod.>
« Last Edit: March 07, 2019, 07:42:27 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Steve has rejected RT 120450 as a duplicate of this bug.
120450 reopened, closed as PATCHED
« Last Edit: March 07, 2019, 07:42:48 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Sun Jul 02 23:45:57 2017 steve [...] deefs.net - Correspondence added

Possible solution in ticket 121832.
« Last Edit: March 07, 2019, 07:43:05 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
 PhilterPaper commented 1 Dec

A fundamental problem here is that cross-reference streams are PDF 1.5, while PDF::Builder (and API2) are supposed to be 1.4. It should refuse to read in a PDF 1.5 file until all 1.5 features are fully implemented! At any rate, it may be able to read these streams, but are not yet able to write them, so the former capability isn't terribly useful.
« Last Edit: March 07, 2019, 07:43:26 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Originally opened on RT as 112932 Can't call method "outobjdeep" in 2.026. Another ticket, RT 117184 Simply opening and saving a multipage PDF file corrupts the file was renamed to Unable to write an opened PDF containing cross-reference streams and the two tickets were merged under the latter's name. To make things more consistent, the title of this ticket will be changed, too.

RT 121832 is suggested as a fix, but the code does not appear to be in PDF::API2 or PDF::Builder. Need to check more on this, as 121832 was closed as fixed in both products! RT 117184 is still open on PDF::API2.

Add:
Outputting cross-reference streams would be marked as "PDF 1.5" output, which would probably be automatic anyway, if the only way to generate a cross-reference stream would be through reading in a PDF 1.5 (or higher) file. If we come up with a way to natively generate a cross-reference stream, it would have to force 1.5 output..
« Last Edit: March 09, 2019, 12:44:18 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Mon Apr 01 11:44:37 2019 PMPERRY@cpan.org - Correspondence added

Ticket 121832 is marked as fixed (resolved), but I don't think Vadim's code was put in, and I don't think the current PDF::API2 (nor PDF::Builder) can deal with writing back out a PDF 1.5 cross-reference stream. I don't know for sure what was "fixed" in that ticket. Perhaps it would be a good time to take another look at either writing out a cross-reference stream or converting it to a classic xref table.

In PDF::Builder, the cross-reference stream output would automatically bump the PDF version to 1.5 (simply reading in such a PDF in the first place will also do so). I have no problems with doing that -- on the other hand, is there a strong argument for converting to an xref table, to stay at PDF 1.4? Cross-reference streams, once read in, seem to be causing more and more trouble, so it would be good to deal with them once and for all.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Tue Apr 02 22:13:42 2019 futuramedium [...] yandex.ru - Correspondence added

Phil,

I have better alternative than patch (hack) from #121832. To please Acrobat/Reader, incremental update can append either classical Xref Table or compressed Xref Stream. The new patch seems to be working. The test PDF file is from this thread.

1) Producing "hybrid files" to ensure "compatibility with older applications" is not implemented (was not even contemplated -- I don't think it's important anymore).

2) No support (with this patch, but would not be difficult in general) for files > ~4 Gb.

3) Somewhat lousy compression (because of no prediction) if someone updates unusually large number of objects -- i.e. generally unlikely).

4) Of course, updated objects are not stuffed into streams, and furthermore this patch does nothing to "use modern compression" when file is clean-output (IIRC, PDF::API2 can't do it anyway).

5) Important -- this patch also applies changes (2 topmost changes) as per #121911.

In fact, fixes are very minimal, existing code is mostly re-used to collect updates made to XRef Table (instead of writing them as they come) and then apply them appropriately in either of 2 modes.

+ One (minor) digression: documentation could be more clear that after calling "saveas" an instance becomes unusable -- to prevent someone writing scripts e.g. such as with commented fragment below.
Code: [Select]
use warnings;
use strict;
use feature 'say';

use PDF::API2;

my $pdf = PDF::API2-> open( "test.pdf" );
$pdf-> page;
$pdf-> page;
$pdf-> page;

$pdf-> saveas( "test-mod.pdf" );

# $pdf-> page;
# $pdf-> page;
# $pdf-> saveas( "test-mod++.pdf" );

__END__
Code: [Select]
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017
+++ PDF\API2\Basic\PDF\File.pm Wed Apr 3 04:01:26 2019
@@ -522,6 +522,7 @@
         if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) {
             bless $result, $types{$result->{'Type'}->val};
+            $result-> {' outto'} = [ $self ];
         }
         # gdj: FIXME: if any of the ws chars were crs, then the whole
         # string might not have been read.
@@ -540,7 +541,7 @@
         }
         $result->{' parent'} = $self;
         weaken $result->{' parent'};
-        $result->{' realised'} = 0;
+#??        $result->{' realised'} = 0;
         # gdj: FIXME: if any of the ws chars were crs, then the whole
         # string might not have been read.
     }
@@ -1282,7 +1283,7 @@
     $tdict->{'Size'} = PDFNum($self->{' maxobj'});

     my $tloc = $fh->tell();
-    $fh->print("xref\n");
+    my @out;
     my @xreflist = sort { $self->{' objects'}{$a->uid}[0] <=> $self->{' objects'}{$b->uid}[0] } (@{$self->{' printed'} || []}, @{$self->{' free'} || []});

@@ -1314,25 +1315,25 @@
 #            $fh->printf("0 1\n%010d 65535 f \n", $ff);
 #        }
         if ($i > $#xreflist || $self->{' objects'}{$xreflist[$i]->uid}[0] != $j + 1) {
-            $fh->print(($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n");
+            push @out, ($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n";
             if ($first == -1) {
-                $fh->printf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0);
+                push @out, sprintf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0);
                 $first = 0;
             }
             for ($j = $first; $j < $i; $j++) {
                 my $xref = $xreflist[$j];
                 if (defined $freelist[$k] && defined $xref && "$freelist[$k]" eq "$xref") {
                     $k++;
-                    $fh->print(pack("A10AA5A4",
+                    push @out, pack("A10AA5A4",
                                     sprintf("%010d", (defined $freelist[$k] ?
                                                       $self->{' objects'}{$freelist[$k]->uid}[0] : 0)), " ",
                                     sprintf("%05d", $self->{' objects'}{$xref->uid}[1] + 1),
-                                    " f \n"));
+                                    " f \n");
                 }
                 else {
-                    $fh->print(pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ",
+                    push @out, pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ",
                                     sprintf("%05d", $self->{' objects'}{$xref->uid}[1]),
-                                    " n \n"));
+                                    " n \n");
                 }
             }
             $first = $i;
@@ -1342,9 +1343,48 @@
             $j++;
         }
     }
     $fh->print("trailer\n"); 
-    $tdict->outobjdeep($fh, $self);
-    $fh->print("\nstartxref\n$tloc\n%%EOF\n");
+    if ( exists $tdict-> { Type } and $tdict-> { Type }-> val eq 'XRef' ) {
+
+        my ( @index, @stream );
+        my $len = 2;                                # 2 or 4 will do
+        for ( @out ) {
+            $_ = [ split ];
+            die if $_-> [ 0 ] >= 0xFFFFFFFF;       # extremely unlikely, but better (any?) message would help
+            $len = 4 if $_-> [ 0 ] >= 0xFFFF;
+            @$_ == 2 ? push @index, @$_ : push @stream, $_
+        }
+        my $c = $len == 2 ? 'n' : 'N';
+        my $stream = join '', map {
+            pack "C${c}C", $_-> [ 2 ] eq 'n' ? 1 : 0, @{ $_ }[ 0 .. 1 ]
+        } @stream;
+
+        $i = $self->{ ' maxobj' } ++;
+        $self-> add_obj( $tdict, $i, 0 );
+        $self-> out_obj( $tdict );
+
+        push @index, $i, 1;
+        $stream .= pack "C${c}C", 1, $tloc, 0;
+
+        $tdict-> { Size } = PDFNum( ++ $i );
+        $tdict-> { Index } = PDFArray( map PDFNum( $_ ), @index );
+        $tdict-> { W } = PDFArray( map PDFNum( $_ ), 1, $len, 1 );
+        $tdict-> { Filter } = PDFName( 'FlateDecode' );
+
+        delete $tdict-> { DecodeParms };    # For such streams, prediction improves compression hugely,
+                                            # but "outfilt" just can't do it, alas.
+
+        $stream = PDF::API2::Basic::PDF::Filter::FlateDecode-> new-> outfilt( $stream, 1 );
+        $tdict-> { ' stream' } = $stream;
+        $tdict-> { ' nofilt' } = 1;
+        delete $tdict-> { Length };
+        $self-> ship_out;
+    }
+    else {
+        $fh->print("xref\n", @out, "trailer\n");
+        $tdict->outobjdeep($fh, $self);
+        $fh->print("\n");
+    }
+    $fh->print("startxref\n$tloc\n%%EOF\n");
 }

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Wed Apr 03 12:53:41 2019 futuramedium [...] yandex.ru - Correspondence added

Should have chosen offset length (2 or 4 bytes) based on $tloc only. Fixed. Also, added filtering to XRef stream. Raw (uncompressed) stream length will grow up to 25% (as with file being tested) because of prepended byte per "row", but for any substantial changes to PDF file, compression ratio will improve significantly. E.g., if, in example script, 6 instead of 3 pages are appended, compressed stream length already becomes 42 vs. 44 bytes for filtered/unfiltered data.

One concern may be that gennum is limited to 1 byte, but, in reality, they haven't been used (and objnums re-used) for a long time. In test file, and all "modern" (with XRef stream) files I've seen, 1st XRef Table entry is "0 0 f". IIRC PDF 2.0 says gennum is always 0.
Code: [Select]
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017
+++ PDF\API2\Basic\PDF\File.pm Wed Apr 3 19:27:37 2019
@@ -522,6 +522,7 @@
         if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) {
             bless $result, $types{$result->{'Type'}->val};
+            $result-> {' outto'} = [ $self ];
         }
         # gdj: FIXME: if any of the ws chars were crs, then the whole
         # string might not have been read.
@@ -540,7 +541,7 @@
         }
         $result->{' parent'} = $self;
         weaken $result->{' parent'};
-        $result->{' realised'} = 0;
+#??        $result->{' realised'} = 0;
         # gdj: FIXME: if any of the ws chars were crs, then the whole
         # string might not have been read.
     }
@@ -1282,7 +1283,7 @@
     $tdict->{'Size'} = PDFNum($self->{' maxobj'});

     my $tloc = $fh->tell();
-    $fh->print("xref\n");
+    my @out;

     my @xreflist = sort { $self->{' objects'}{$a->uid}[0] <=> $self->{' objects'}{$b->uid}[0] } (@{$self->{' printed'} || []}, @{$self->{' free'} || []});

@@ -1314,25 +1315,25 @@
 #            $fh->printf("0 1\n%010d 65535 f \n", $ff);
 #        }
         if ($i > $#xreflist || $self->{' objects'}{$xreflist[$i]->uid}[0] != $j + 1) {
-            $fh->print(($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n");
+            push @out, ($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n";
             if ($first == -1) {
-                $fh->printf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0);
+                push @out, sprintf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0);
                 $first = 0;
             }
             for ($j = $first; $j < $i; $j++) {
                 my $xref = $xreflist[$j];
                 if (defined $freelist[$k] && defined $xref && "$freelist[$k]" eq "$xref") {
                     $k++;
-                    $fh->print(pack("A10AA5A4",
+                    push @out, pack("A10AA5A4",
                                     sprintf("%010d", (defined $freelist[$k] ?
                                                       $self->{' objects'}{$freelist[$k]->uid}[0] : 0)), " ",
                                     sprintf("%05d", $self->{' objects'}{$xref->uid}[1] + 1),
-                                    " f \n"));
+                                    " f \n");
                 }
                 else {
-                    $fh->print(pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ",
+                    push @out, pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ",
                                     sprintf("%05d", $self->{' objects'}{$xref->uid}[1]),
-                                    " n \n"));
+                                    " n \n");
                 }
             }
             $first = $i;
@@ -1342,9 +1343,48 @@
             $j++;
         }
     }
     $fh->print("trailer\n");
-    $tdict->outobjdeep($fh, $self);
-    $fh->print("\nstartxref\n$tloc\n%%EOF\n");
+    if ( exists $tdict-> { Type } and $tdict-> { Type }-> val eq 'XRef' ) {
+
+        my ( @index, @stream );
+        for ( @out ) {
+            my @a = split;
+            @a == 2 ? push @index, @a : push @stream, \@a
+        }
+        $i = $self->{ ' maxobj' } ++;
+        $self-> add_obj( $tdict, $i, 0 ); 
+        $self-> out_obj( $tdict );
+
+        push @index, $i, 1;
+        push @stream, [ $i, 0, 'n' ];
+
+        $i = $self->{ ' maxobj' } ++;
+        $self-> add_obj( $tdict, $i, 0 );
+        $self-> out_obj( $tdict );
+
+        my $len = $tloc > 0xFFFF ? 4 : 2;           # don't expect files > 4 Gb
+        my $tpl = $tloc > 0xFFFF ? 'CNC' : 'CnC';   # don't expect gennum > 255, it's absurd.
+                                                    # Adobe doesn't use them anymore anyway
+        my $stream = '';
+        my @prev = ( 0 ) x ( $len + 2 );
+        for ( @stream ) {
+            my @line = unpack 'C*', pack $tpl, $_-> [ 2 ] eq 'n' ? 1 : 0, @{ $_ }[ 0 .. 1 ];
+
+            $stream .= pack 'C*', 2,                # prepend filtering method, "PNG Up"
+                map {( $line[ $_ ] - $prev[ $_ ] + 256 ) % 256 } 0 .. $#line;
+            @prev    = @line;
+        }
+        $tdict-> { Size } = PDFNum( $i + 1 );
+        $tdict-> { Index } = PDFArray( map PDFNum( $_ ), @index );
+        $tdict-> { W } = PDFArray( map PDFNum( $_ ), 1, $len, 1 ); 
+        $tdict-> { Filter } = PDFName( 'FlateDecode' );
+
+        $tdict-> { DecodeParms } = PDFDict;
+        $tdict-> { DecodeParms }-> val-> { Predictor } = PDFNum( 12 );
+        $tdict-> { DecodeParms }-> val-> { Columns } = PDFNum( $len + 2 );

+        $stream = PDF::API2::Basic::PDF::Filter::FlateDecode-> new-> outfilt( $stream, 1 );
+        $tdict-> { ' stream' } = $stream;
+        $tdict-> { ' nofilt' } = 1;
+        delete $tdict-> { Length };
+        $self-> ship_out;
+    }
+    else {
+        $fh->print("xref\n", @out, "trailer\n");
+        $tdict->outobjdeep($fh, $self);
+        $fh->print("\n");
+    }
+    $fh->print("startxref\n$tloc\n%%EOF\n");
 }

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Wow! That's quite a bit of work you've put in -- thank you. It's complicated enough that I want to go over it very carefully (and of course, test it thoroughly) before putting it in PDF::Builder. I can't even yet ask any questions about it! I hope to get it in for release 3.014, unless there are complications, in which case it may slide to 3.015 this summer.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Mon Apr 08 17:59:57 2019 futuramedium [...] yandex.ru - Correspondence added

Found minor issues: though harmless, they'd better be fixed. I hope that's final version, sorry for the mess.
Code: [Select]
--- PDF\API2\Basic\PDF\File.old Fri Jul 7 04:53:59 2017
+++ PDF\API2\Basic\PDF\File.pm Tue Apr 9 00:46:42 2019
@@ -522,6 +522,8 @@

         if (defined $result->{'Type'} and defined $types{$result->{'Type'}->val}) {
             bless $result, $types{$result->{'Type'}->val};
+            $result-> {' outto'} = [ $self ];
+            weaken $_ for @{$result->{' outto'}};
         }
         # gdj: FIXME: if any of the ws chars were crs, then the whole
         # string might not have been read.
@@ -540,7 +542,7 @@
         }
         $result->{' parent'} = $self;
         weaken $result->{' parent'};
-        $result->{' realised'} = 0;
+#??     $result->{' realised'} = 0;
         # gdj: FIXME: if any of the ws chars were crs, then the whole
         # string might not have been read.
     }
@@ -1282,7 +1284,7 @@
     $tdict->{'Size'} = PDFNum($self->{' maxobj'});

     my $tloc = $fh->tell();
-    $fh->print("xref\n");
+    my @out; my @xreflist = sort { $self->{' objects'}{$a->uid}[0] <=> $self->{' objects'}{$b->uid}[0] } (@{$self->{' printed'} || []}, @{$self->{' free'} || []});

@@ -1314,25 +1316,25 @@
 #            $fh->printf("0 1\n%010d 65535 f \n", $ff);
 #        }
         if ($i > $#xreflist || $self->{' objects'}{$xreflist[$i]->uid}[0] != $j + 1) {
-            $fh->print(($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n");
+            push @out, ($first == -1 ? "0 " : "$self->{' objects'}{$xreflist[$first]->uid}[0] ") . ($i - $first) . "\n";
             if ($first == -1) {
-                $fh->printf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0);
+                push @out, sprintf("%010d 65535 f \n", defined $freelist[$k] ? $self->{' objects'}{$freelist[$k]->uid}[0] : 0);
                 $first = 0;
             }
             for ($j = $first; $j < $i; $j++) {
                 my $xref = $xreflist[$j];
                 if (defined $freelist[$k] && defined $xref && "$freelist[$k]" eq "$xref") {
                     $k++;
-                    $fh->print(pack("A10AA5A4",
+                    push @out, pack("A10AA5A4",
                                     sprintf("%010d", (defined $freelist[$k] ?
                                                       $self->{' objects'}{$freelist[$k]->uid}[0] : 0)), " ",
                                     sprintf("%05d", $self->{' objects'}{$xref->uid}[1] + 1),
-                                    " f \n"));
+                                    " f \n");
                 }
                 else {
-                    $fh->print(pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ",
+                    push @out, pack("A10AA5A4", sprintf("%010d", $self->{' locs'}{$xref->uid}), " ",
                             sprintf("%05d", $self->{' objects'}{$xref->uid}[1]),
-                            " n \n"));
+                            " n \n");
                 }
             }
             $first = $i;
@@ -1342,9 +1344,53 @@
             $j++;
         }
     }
-    $fh->print("trailer\n");
-    $tdict->outobjdeep($fh, $self);
-    $fh->print("\nstartxref\n$tloc\n%%EOF\n");
+    if ( exists $tdict-> { Type } and $tdict-> { Type }-> val eq 'XRef' ) {
+
+        my ( @index, @stream );
+        for ( @out ) {
+            my @a = split;
+            @a == 2 ? push @index, @a : push @stream, \@a
+        }
+        my $i = $self->{ ' maxobj' } ++;
+        $self-> add_obj( $tdict, $i, 0 );
+        $self-> out_obj( $tdict );
+
+        push @index, $i, 1;
+        push @stream, [ $tloc, 0, 'n' ];
+
+        my $len = $tloc > 0xFFFF ? 4 : 2;           # don't expect files > 4 Gb
+        my $tpl = $tloc > 0xFFFF ? 'CNC' : 'CnC';   # don't expect gennum > 255, it's absurd.
+                                                    # Adobe doesn't use them anymore anyway
+        my $stream = '';
+        my @prev = ( 0 ) x ( $len + 2 );
+        for ( @stream ) {
+            my @line = unpack 'C*', pack $tpl, $_-> [ 2 ] eq 'n' ? 1 : 0, @{ $_ }[ 0 .. 1 ];
+
+            $stream .= pack 'C*', 2, # prepend filtering method, "PNG Up"
+                map {( $line[ $_ ] - $prev[ $_ ] + 256 ) % 256 } 0 .. $#line;
+            @prev    = @line;
+        }
+        $tdict-> { Size } = PDFNum( $i + 1 );
+        $tdict-> { Index } = PDFArray( map PDFNum( $_ ), @index );
+        $tdict-> { W } = PDFArray( map PDFNum( $_ ), 1, $len, 1 );
+        $tdict-> { Filter } = PDFName( 'FlateDecode' );
+
+        $tdict-> { DecodeParms } = PDFDict;
+        $tdict-> { DecodeParms }-> val-> { Predictor } = PDFNum( 12 );
+        $tdict-> { DecodeParms }-> val-> { Columns } = PDFNum( $len + 2 );
+
+        $stream = PDF::API2::Basic::PDF::Filter::FlateDecode-> new-> outfilt( $stream, 1 );
+        $tdict-> { ' stream' } = $stream;
+        $tdict-> { ' nofilt' } = 1;
+        delete $tdict-> { Length };
+        $self-> ship_out;
+    }
+    else {
+        $fh->print("xref\n", @out, "trailer\n");
+        $tdict->outobjdeep($fh, $self);
+        $fh->print("\n");
+    }
+    $fh->print("startxref\n$tloc\n%%EOF\n");
 }

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Tue Apr 23 22:50:02 2019 PMPERRY@cpan.org - Correspondence added

Hi Vadim,

It looks like it's almost there. I did encounter one error message while running your test.pl code: Character in 'C' format wrapped in pack at .../File.pm line 1507. That line is after for (@stream) { :
Code: [Select]
my @line = unpack 'C*', pack $tpl, $_->[2] eq 'n'...

Any ideas? I'm running PDF::Builder on Perl 5.26, if you tested at an earlier version. I applied only the changes in your last posting of the diffs to File.pm (they appeared to be cumulative). The old code produced an unloadable corrupted PDF, but the new File.pm code produced a working PDF. It's the original PDF 1.5 input with some new stuff, including a cross reference stream, added after the %%EOF, if that's the correct result. I note that some objects have the same number as those found earlier in the input -- I take it they override (replace) the earlier object of the same number?

Should this cross reference stream output be seen ONLY if the file read in was PDF 1.5 or higher, with a cross reference stream? That is, nothing at PDF 1.4 or lower in PDF::Builder should cause a cross reference stream to be output? If something can cause it, I will need to add a line of code to force a minimum of PDF 1.5 output level.

Finally, I looked at your complaint about 'saveas()' not permitting further updates. Indeed, after saveas(), $pdf is still defined and is still a hash, but $pdf->page() blows up (can't call method new_obj on an undefined value). Could you look at RT 81530 and see if it sounds related? Possibly $pdf should be marked as unusable, or even be undefined, once save(), saveas(), or stringify() is called? At the least, I can expand the documentation to warn about this.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
I think I may have solved the first problem ('wrapped' warning), but I'd like your opinion on it. After the line
Code: [Select]
       for (@stream) {
and before
Code: [Select]
           my @line = unpack 'C*', pack $tpl, $_->[ 2 ] eq 'n'? 1 : 0, @{ $_ }[ 0 .. 1 ];
I added
Code: [Select]
           $_->[1] &= 0x00FF;
to ensure that the value was in the range 0..255. Apparently, packing with C will do the same thing, but now issues a "wrapped" warning. Anyway, it seems to work. Was this value, which was 65535 in @stream, the one you referred to as "don't expect gennum > 255, it's absurd." or was that the other value?

$tloc was 27173, $len was 2, $tpl was 'CnC'. The first @stream was 0000000000 65535 f (I assume the 0's collapse to integer 0) and the result was 0 0 0 255 (xFFFF trimmed to xFF). The second @stream was 27173 0 n (x6A25) which gave a result of 1 106 37 0 (x6A x25). It's the same result as the original code, without the nasty warning.

By the way, I was concerned about both $stream and @stream being used together, so I renamed $stream to $sstream to eliminate any possibility of one being used for the other.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Wed Apr 24 18:25:51 2019 futuramedium [...] yandex.ru - Correspondence added

My test script didn't have shebang line with a "-w", that's why I didn't see this warning! As you already found, the issue is with generation number 65535 of 0th object, i.e. what would be "0000000000 65535 f" line in classic table. I'd put the following at exact location where you suggested:
Code: [Select]
$_-> [ 1 ] = 0 if $_-> [ 2 ] eq 'f' and
                  $_-> [ 1 ] == 65535;
Examples in the Reference show, that gennum 65535 can be used to mark objects as "not to be re-used", i.e. in theory, objects other than 0th can have it, too. In practice, apart from 65535 for object 0, my "absurd" comment was that probably no PDF file has ever had such long and twisted history of incremental updates, that gennum of any object is more than a dozen at most!

Further, as implementation note 16 in the Reference says, "Acrobat 6.0 and later do not use the free list to recycle object numbers; new objects are assigned new numbers." That's in accordance with my observation that 0th object (1st entry) in XRef stream has gennum of 0 in PDF files I've seen.

Quote
original PDF 1.5 input with some new stuff, including a cross reference stream, added after the %%EOF, if that's the correct result.
That was the whole point -- to allow incremental update for files that are using Xref stream, so that Acrobat is OK with result. As I said earlier elsewhere, other viewers don't mind if update with classic XRef table is appended to file with Xref stream.

Quote
I note that some objects have the same number as those found earlier in the input -- I take it they override (replace) the earlier object of the same number?
That's correct, it's how incremental update mechanism is described in the Reference.

Quote
Should this cross reference stream output be seen ONLY if the file read in was PDF 1.5 or higher, with a cross reference stream?
Correct -- in my patch, the presence of "Type" entry in $tdict and its value being "XRef" are checked. Based on that, either classic or stream Xref information is written. I believe it's robust enough (maybe some other check would be better? A flag set when file was read?) I think we should only expect well-formed PDFs, which have correct version if they use Xref stream. But, it's possible that file's version would be 1.5 and above, but, for any reason, it has classic table. Then, my patch will append classic table.

Quote
Could you look at RT 81530 and see if it sounds related?
Ah, so it's old issue, and better documentation is on its way :)

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Wed Apr 24 20:36:38 2019 PMPERRY@cpan.org - Correspondence added

On Wed Apr 24 18:25:51 2019, vadimr wrote:
Quote
My test script didn't have shebang line with a "-w", that's why I didn't see this warning!
That'll learn ya! :) I make it a habit to start each .pl with use strict; and use warnings;.

Quote
Code: [Select]
$_-> [ 1 ] = 0 if $_-> [ 2 ] eq 'f' and
                  $_-> [ 1 ] == 65535;
Examples in the Reference show, that gennum 65535 can be used to mark objects as "not to be re-used", i.e. in theory, objects other than 0th can have it, too. In practice, apart from 65535 for object 0, my "absurd" comment was that probably no PDF file has ever had such long and twisted history of incremental updates, that gennum of any object is more than a dozen at most!

Further, as implementation note 16 in the Reference says, "Acrobat 6.0 and later do not use the free list to recycle object numbers; new objects are assigned new numbers." That's in accordance with my observation that 0th object (1st entry) in XRef stream has gennum of 0 in PDF files I've seen.

OK, you explain it's OK to 0 out this particular 16-bit value (xFFFF), as no Reader that handles cross reference streams pays attention to the value anyway. Is that the only place that a value greater than xFF is ever going to show up? If it's documented that Readers don't care what the gennum value is in this case, that's fine, but I'm leery of "observations" that it always works that way -- there might be oddball Readers out there that do care about this value.

I could imagine someone constantly updating a PDF file for some reason, perhaps "saving" instead of "quitting" a Reader. I saw this back in the early days of JPEG file usage -- a co-worker was complaining that his JPEG images were slowing rotting away. I had to explain to him that he was saving the image each time he wanted to quit the viewer, so the image was losing more high frequency data each time! Anyway, it's not impossible that the gennum could end up > 255 in some strange situations.

Since $_->[1] is going to be packed with 'C', I think it would be a good idea to stay with my fix of clearing high bits to ensure that it's in the 0..255 range. If someone does get a gennum > 255, cycling back to 0 might cause problems, but that's life. At least they won't get a "wrapped" warning. If a cross reference stream is always going to handle it as a single byte, it can't be allowed to exceed 255, whatever its purpose. Should we consider treating it as 16-bits, if the standard permits?

Quote
Quote
original PDF 1.5 input with some new stuff, including a cross reference stream, added after the %%EOF, if that's the correct result.
That was the whole point
Let me rephrase my question, then -- is the correct result to output the original, unchanged (almost) PDF, and then tack on these new and replacement objects, and maybe a cross reference stream, after the original %%EOF? You seem to have said "yes".

Quote
Quote
Could you look at RT 81530 and see if it sounds related?
Ah, so it's old issue, and better documentation is on its way :)
I promise, I will put something in, at least in the POD for save, saveas, and stringify!

Unless you have any further updates or strong objections to something, I think I will put out PDF::Builder 3.014 this weekend, with this ticket closed.