Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering.

Problem extracting pages from PDF v. 1.6 documents

  • 5 Replies
  • 172 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 392
    • View Profile
Problem extracting pages from PDF v. 1.6 documents
« July 14, 2018, 11:46:07 AM »
carylewis posted

I am trying to import pages from a set of PDFs generated by a third party. I had been using PDF::API2 but have encountered issues where the extracted pages result in PDFs that do not display correctly.

I am encountering the same issues with PDF::Builder.

After extracting one page and saving the document, and verifying the document with ghostscript, I see these errors:

gs -dNOPAUSE -dBATCH -sDEVICE=nullpage new.pdf
GPL Ghostscript 9.23 (2018-03-21)
Copyright (C) 2018 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
Processing pages 1 through 1.
Page 1
**** Error reading a content stream. The page may be incomplete.
Output may be incorrect.

**** Error: File has unbalanced q/Q operators (too many Q's)
Output may be incorrect.
**** Error: Form stream has unbalanced q/Q operators (too many q's)
Output may be incorrect.
**** Error reading a content stream. The page may be incomplete.
Output may be incorrect.
**** Error: File did not complete the page properly and may be damaged.
Output may be incorrect.

**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> PDF::Builder 3.009 [see https://github.com/PhilterPaper/Perl-PDF-Builder/blob/master/SUPPORT] <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

This is the perl script:

Code: [Select]
use PDF::Builder;

$pdf = PDF::Builder->new();
$old = PDF::Builder->open('orig.pdf');
 
$page = $pdf->import_page($old, 2);
$pdf->saveas('new.pdf');

I have attached the orig.pdf.

Thanks for any help or insights you can provide.

I also tried PDF::Extract, which was able to successfully extract the two pages into separate documents, that were displayable, but were not extractable by PDF::Builder.

Converting the orig.pdf to pdf v. 1.4 allows PDF::Builder to work, but using ghostscript to convert the files into 1.4 does not scale very well.

[orig.pdf is too large to attach, need to get from GitHub]

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 392
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #1: July 14, 2018, 12:11:03 PM »
 PhilterPaper commented

PDF::Builder (as well as PDF::API2) is known to have problems with PDFs of version 1.5 and up. I tried splitting all run-together lines (at ^M), but it didn't seem to work, so there may be something else. You say it works OK as version 1.4. I take it you can't create it originally as PDF 1.4?

If you (or someone) can isolate the PDF 1.5+ statements that are causing the trouble, we could consider adding code to support these statements. I will mark this "help wanted" in case someone can offer help.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 392
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #2: July 14, 2018, 06:02:24 PM »
carylewis replied:

I don’t know how to isolate the offending bits. I suspect it’s something to do with the meta data. The copied page is somewhat visible but with lots of weird repeating rectangles, so maybe there something not being copied correctly like image size?

Could it be a character encoding issue?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 392
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #3: July 17, 2018, 11:21:37 AM »
If it worked when converted to PDF 1.4, I doubt it's a character encoding issue. I suspect there is something at PDF 1.5 or 1.6 that is not being processed correctly. It might very well be in the metadata. I hope to get some time soon to examine it more deeply.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 392
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #4: July 17, 2018, 11:23:50 AM »
carylewis replied yesterday:

Thanks for the replies, by the way, it is appreciated.

I agree with you that's its not a encoding issue.

I did some more digging, using itext rups, and it appears as though the PDF::API2 and PDF::Builder can not handle the new pdf v. 1.6 technique of indirect objects.

But the structure of the PDF i uploaded is quite complex, and I can't say what is exactly wrong.

Ghostscript version 9.23 can convert these documents to v. 1.4, but the way it does it seems very different than how the perl libraries do it.

I am willing to help of course, if you come across anything and need someone to do some coding, please let me know.

Does PDF::Builder use PDF::API2?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 392
    • View Profile
Re: Problem extracting pages from PDF v. 1.6 documents
« Reply #5: July 17, 2018, 11:24:52 AM »
Quote
Does PDF::Builder use PDF::API2?

PDF::Builder is a fork of PDF::API2. It is built on the PDF::API2 2.029 code base (with updates) and is still largely compatible with PDF::API2. I'm trying to keep existing interfaces as compatible as possible as I fix bugs and add new function.

The direct answer to your question is "no". It does not pull in or use the PDF::API2 library.