Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

Problem extracting pages from PDF v. 1.6 documents

  • 8 Replies
  • 382 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Problem extracting pages from PDF v. 1.6 documents
« July 14, 2018, 11:46:07 AM »
carylewis posted

I am trying to import pages from a set of PDFs generated by a third party. I had been using PDF::API2 but have encountered issues where the extracted pages result in PDFs that do not display correctly.

I am encountering the same issues with PDF::Builder.

After extracting one page and saving the document, and verifying the document with ghostscript, I see these errors:

gs -dNOPAUSE -dBATCH -sDEVICE=nullpage new.pdf
GPL Ghostscript 9.23 (2018-03-21)
Copyright (C) 2018 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
**** Error reading a content stream. The page may be incomplete.
Output may be incorrect.

**** Error: File has unbalanced q/Q operators (too many Q's)
Output may be incorrect.
**** Error: Form stream has unbalanced q/Q operators (too many q's)
Output may be incorrect.
**** Error reading a content stream. The page may be incomplete.
Output may be incorrect.
**** Error: File did not complete the page properly and may be damaged.
Output may be incorrect.

**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> PDF::Builder 3.009 [see https://github.com/PhilterPaper/Perl-PDF-Builder/blob/master/SUPPORT] <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

This is the perl script:

Code: [Select]
use PDF::Builder;

$pdf = PDF::Builder->new();
$old = PDF::Builder->open('orig.pdf');
 
$page = $pdf->import_page($old, 2);
$pdf->saveas('new.pdf');

I have attached the orig.pdf.

Thanks for any help or insights you can provide.

I also tried PDF::Extract, which was able to successfully extract the two pages into separate documents, that were displayable, but were not extractable by PDF::Builder.

Converting the orig.pdf to pdf v. 1.4 allows PDF::Builder to work, but using ghostscript to convert the files into 1.4 does not scale very well.

[orig.pdf is too large to attach, need to get from GitHub]

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #1: July 14, 2018, 12:11:03 PM »
 PhilterPaper commented

PDF::Builder (as well as PDF::API2) is known to have problems with PDFs of version 1.5 and up. I tried splitting all run-together lines (at ^M), but it didn't seem to work, so there may be something else. You say it works OK as version 1.4. I take it you can't create it originally as PDF 1.4?

If you (or someone) can isolate the PDF 1.5+ statements that are causing the trouble, we could consider adding code to support these statements. I will mark this "help wanted" in case someone can offer help.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #2: July 14, 2018, 06:02:24 PM »
carylewis replied:

I don’t know how to isolate the offending bits. I suspect it’s something to do with the meta data. The copied page is somewhat visible but with lots of weird repeating rectangles, so maybe there something not being copied correctly like image size?

Could it be a character encoding issue?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #3: July 17, 2018, 11:21:37 AM »
If it worked when converted to PDF 1.4, I doubt it's a character encoding issue. I suspect there is something at PDF 1.5 or 1.6 that is not being processed correctly. It might very well be in the metadata. I hope to get some time soon to examine it more deeply.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #4: July 17, 2018, 11:23:50 AM »
carylewis replied yesterday:

Thanks for the replies, by the way, it is appreciated.

I agree with you that's its not a encoding issue.

I did some more digging, using itext rups, and it appears as though the PDF::API2 and PDF::Builder can not handle the new pdf v. 1.6 technique of indirect objects.

But the structure of the PDF i uploaded is quite complex, and I can't say what is exactly wrong.

Ghostscript version 9.23 can convert these documents to v. 1.4, but the way it does it seems very different than how the perl libraries do it.

I am willing to help of course, if you come across anything and need someone to do some coding, please let me know.

Does PDF::Builder use PDF::API2?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #5: July 17, 2018, 11:24:52 AM »
Quote
Does PDF::Builder use PDF::API2?

PDF::Builder is a fork of PDF::API2. It is built on the PDF::API2 2.029 code base (with updates) and is still largely compatible with PDF::API2. I'm trying to keep existing interfaces as compatible as possible as I fix bugs and add new function.

The direct answer to your question is "no". It does not pull in or use the PDF::API2 library.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #6: September 13, 2018, 02:10:18 PM »
carylewis replied 29 July 2018:

Are there any plans on changing the format of the produced pdf to 1.5 or above?

I’m seeing more and more jp2 files, and pdf 1.4 doesn’t support that fije type. This necessitates converting files to 1.4 and that means converted the jpeg2000 images to tiff which is slow.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #7: September 13, 2018, 02:11:40 PM »
Phil replied 30 July 2018:

Yes, I need to do something to properly handle read-in PDFs of 1.5+, and allow production of PDF 1.5+. I'm still pondering what the best way to do this would be. I'm not sure that PDF::Builder (or its predecessor, PDF::API2) even fully implements PDF 1.0, much less 1.4, not to mention higher levels. See https://www.catskilltech.com/forum/pdf-builder-general-discussions/bringing-pdfapi2-into-the-21st-century/ (or #93) and contributions and thoughts are welcome.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 417
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #8: September 13, 2018, 02:14:08 PM »
A Standing Invitation for Contributors

PDF::Builder is known to be incompatible with many PDFs of level 1.5 and up, when they are read in. I know of only one PDF 1.5+ feature that is implemented (cross reference streams) -- it may be assumed that many features first appearing in 1.5 will cause problems. Hell, not even all of PDF 1.0-1.4 is implemented, so other problems could be encountered.

You are invited to isolate PDF incompatibilities (whether found in 1.5 and up, or in earlier versions), and specifically report them (in a new bug thread). With enough detailed information, I can consider implementing them (code contributions are of course, welcome!). PDF::Builder won't become PDF-1.7 compatible overnight, but at least we can keep chipping away at it.

By the way, does anyone know of a good tool to "dump" a PDF into XML or some other human-readable format? That could make diagnosing and understanding a problem much easier. Even better, the tool can allow hand-editing of the content and convert it back into PDF (binary conversion and compression). There are a few tools that more or less do this (at least, the dump), but they're either expensive or require that the PDF be uploaded to another site. If anyone's looking for a new Perl-based CPAN project, possibly using PDF::Builder as a library, this could be something good!

Incidentally, I have implemented the automatic "bump" of PDF version level for input PDFs and output features (none yet) mentioned in the previous post, so we're ready on that front.