Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

RT 117210 - Opening damaged files (was Error: "can't call method "realise" on a)

  • 5 Replies
  • 2124 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Thu Aug 25 08:41:45 2016 regtapy [...] yandex.ru - Ticket created
Subject:    Error: "can't call method "realise" on an undefined value" while open a pdf-file

Hello,

system info:
PDF::API2 VERSION 2.028
perl v5.10.1
FreeBSD 8.3-RELEASE-p3

I've got the error "Can't call method "realise" on an undefined value at /usr/local/lib/perl5/site_perl/5.10.1/PDF/API2.pm line 199." when I tried to open attached file (dr-hilton.pdf).

The minimum code is:
Code: [Select]
#!/usr/bin/perl

use strict;
use warnings;

use PDF::API2;

my $file = 'dr-hilton.pdf';
eval { my $src_pdf = PDF::API2->open( $file ) };
if ( $@ ) {
 warn "Error: $@";
}

1;
Any ideas what's wrong with that?
Thanks

#
Fri Oct 07 00:10:33 2016 steve [...] deefs.net - Correspondence added
Download (untitled) / with headers
text/plain 1.3k
It looks like the file is corrupt.  At the end of the file (byte 187481), there's an "xref" line followed by "1 7".  The 1 indicates that the first object number in the cross-reference table will be 1, and that there are 7 entries (the next seven lines).  However, the next seven lines are numbered from 0 rather than 1.

The "1 7" is invalid -- according to the spec, if there's only one cross-reference table (and this file only has one), it has to start with 0.  PDF::API2 assumes that the 1 is intentional and doesn't return an error.  Since Adobe Reader opens it without complaining, it looks like it assumes the 1 is a mistake and also doesn't return an error.

If you change the "1 7" to "0 7", the file will open properly in both PDF::API2 and Adobe Reader.

#
Fri Oct 07 00:10:33 2016 The RT System itself - Status changed from 'new' to 'open'
#
Fri Oct 07 00:10:38 2016 steve [...] deefs.net - Status changed from 'open' to 'rejected'
« Last Edit: April 15, 2017, 03:39:58 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
This (rejected) bug report brings up an interesting situation. Can — and should — PDF::API2 allow some minor errors in read-in files, and fix them? Of course, a warning message should be issued to the user that there was something wrong with their input PDF file, and PDF::API2 was able to fix it during the read (the original file should obviously not be touched).

Such a capability should probably wait until the issue of handling different PDF versions (both input and output) are adequately sorted out.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
See also RT 106020 — a proposal for doing some sort of validation on PDF files being read in. In that case, it may involve some fixup of a somewhat out-of-spec PDF file, as many readers apparently do. Validation may want to wait for implementation of a PDF version number setting.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Pull request on PDF::API2

 gvsyn commented 8 days ago

When trying to open PDFs from Sharp scanners, open fails reporting invalid PDF version. The header is as follows:

%PDF-1.4 Sharp Scanned ImagePDF
%Sharp Non-Encryption
3 0 obj

From the PDF spec there are no restrictions stating that after the minor version there should be a newline or similar. Adjusted the code to not care what there is after the 1.x

Suggested code fix in   lib/PDF/API2/Basic/PDF/File.pm:
Code: [Select]
@@ -241,7 +241,7 @@ sub open {
241  241        binmode $fh, ':raw';
242  242        $fh->seek(0, 0);            # go to start of file
243  243        $fh->read($buffer, 255);
244         -   unless ($buffer =~ m/^\%PDF\-1\.(\d)+\s*$cr/mo) {
     244    +   unless ($buffer =~ m/^\%PDF\-1\.(\d)+.*$cr/mo) {
245  245            die "$filename not a PDF file version 1.x";
246  246        }
247  247        $self->{' version'} = $1;

*****Implementation: consider grabbing any comment after PDF-x.y and making a new comment line after it
*****  keeping in mind version min/max for read-in version number
« Last Edit: July 07, 2018, 07:30:26 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Per Gareth Wilkin's (gvsyn) pull request on PDF::API2, it certainly appears harmless to allow any garbage after the PDF version number (as a comment), since a number of PDF readers apparently tolerate this (treating it as a comment). If this PDF were to be written back out, we might want to consider turning the extra material into a proper comment. If this was moved to the next line, as an added comment, we would have to be careful not to break any offsets in the cross reference table! In that case, changing the first \s after the version number to a % might work (or, if multiple \s's, a later one). On the other hand, if we're going to regenerate the cross reference table anyway, it would be better to move the new comment to a new line, and be left with an absolutely proper PDF header.

Note: .* by itself should be a "greedy" match, which might eat the $cr sequence. I will replace it by .*? which should be non-greedy.

CAUTION: a comment containing at least 4 "binary" (128 or higher) bytes immediately following the version header is interpreted as telling the Reader that the PDF contains binary data (the byte values themselves are unimportant). See PDF spec 1.7 §7.5.2. Therefore, be careful NOT to stick any other comments between the version header and the binary flag comment!

Regarding the 1 generation number instead of 0, that should certainly give at least a "warning" if no other xref table is found, and the non-0 corrected to 0. Apparently many readers are tolerant of this error, and PDF::Builder should be as forgiving, providing it doesn't cause further errors.

These decisions may not need to wait, but should be revisited once we decide how we're going to read in a PDF file into internal data structures. For the time being, they can be tolerated (fixed up), with possibly a switch to turn off or limit the error message. Coordinate with 106020/#27, which appears to also be a comment on the header line.
« Last Edit: May 12, 2019, 03:30:17 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Code has been added to allow a comment on the header line, and to tolerate a variety of out-of-spec formats in the cross reference list, including mislabeling object 0 as object 1. These conditions give warnings. Note that the example dr-hilton.pdf file also uses a single CR as the EOL marker, not the required two bytes. PDF::Builder already tolerated this.

Closed, will appear in release 3.014.