Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering.

RT 105579 - Given same input, different (byte- and sizewise) PDF files are creat

  • 9 Replies
  • 1772 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 387
    • View Profile
Tue Jun 30 14:03:54 2015 DMITRI [...] cpan.org - Ticket created
Subject:    Given same input, different (byte- and sizewise) PDF files are created

I am use PDF::API2 to create very simple, text-only, PDF files.  I noticed that when I use the same text as input, PDF::API2 produces different output -- the files have different sizes.  I am attaching a program to demonstrate.  Run it three times: changes are, you will end up with three PDF files of different sizes.

I used ImageMagick's convert utility to convert these PDFs to GIFs: the GIFs are identical.  This is good, but I think it the source of randomness should be removed, just for one's sanity's sake.

Code: [Select]
use strict;
use warnings;

use Getopt::Long;
use PDF::API2;

GetOptions(
    "n-pages=i" => \(my $n_pages = 100),
);

my $text = <<'TEXT';
     Messages consist of lines of text.   No  special  provisions
are  made for encoding drawings, facsimile, speech, or structured
text.  No significant consideration has been given  to  questions
of  data  compression  or to transmission and storage efficiency,
and the standard tends to be free with the number  of  bits  con-
sumed.   For  example,  field  names  are specified as free text,
rather than special terse codes.


     A general "memo" framework is used.  That is, a message con-
sists of some information in a rigid format, followed by the main
part of the message, with a format that is not specified in  this
document.   The  syntax of several fields of the rigidly-formated
("headers") section is defined in  this  specification;  some  of
these fields must be included in all messages.


     The syntax  that  distinguishes  between  header  fields  is
specified  separately  from  the  internal  syntax for particular
fields.  This separation is intended to allow simple  parsers  to
operate on the general structure of messages, without concern for
the detailed structure of individual header fields.   Appendix  B
is provided to facilitate construction of these parsers.


     In addition to the fields specified in this document, it  is
expected  that  other fields will gain common use.  As necessary,
the specifications for these "extension-fields" will be published
through  the same mechanism used to publish this document.  Users
may also  wish  to  extend  the  set  of  fields  that  they  use
privately.  Such "user-defined fields" are permitted.


     The framework severely constrains document tone and  appear-
ance and is primarily useful for most intra-organization communi-
cations and  well-structured   inter-organization  communication.
It  also  can  be used for some types of inter-process communica-
tion, such as simple file transfer and remote job entry.  A  more
robust  framework might allow for multi-font, multi-color, multi-
dimension encoding of information.  A  less  robust  one,  as  is
present  in  most  single-machine  message  systems,  would  more
severely constrain the ability to add fields and the decision  to
include specific fields.  In contrast with paper-based communica-
tion, it is interesting to note that the RECEIVER  of  a  message
can   exercise  an  extraordinary  amount  of  control  over  the
message's appearance.  The amount of actual control available  to
message  receivers  is  contingent upon the capabilities of their
individual message systems.
TEXT

my $pdf = PDF::API2->new;
my $font = $pdf->corefont('Courier');

for (my $n = 0; $n < $n_pages; ++$n) {
    my @lines = split /\n/, $text;
    # Change the text up a little bit (move a line to the first
    # position), so that I can tell that there is more than one
    # page in a GIF when I convert it.  (I use ImageMagick's
    # convert utility to convert PDFs to GIF to do pixel-by-pixel
    # comparison).
    my $pick_a_line = splice @lines, $n % @lines, 1;
    my $page_text = join "\n", $pick_a_line, @lines;
    my $page = $pdf->page;
    $page->mediabox(612, 792);
    my $content = $page->text;
    $content->translate(0, 780);
    $content->font($font, 12);
    $content->lead(12);
    $content->section($page_text, 612, 780);
}

$pdf->saveas($ARGV[0]);
#
Tue Jun 30 14:41:31 2015 steve [...] deefs.net - Correspondence added

This is normal.  Dictionaries (hashes) won't always output in the same order, and PDF::API2 uses timestamps to generate IDs.  Both of these can impact compression, resulting in PDFs with different sizes even though they're generated by the same script.
#
Tue Jun 30 14:41:31 2015 The RT System itself - Status changed from 'new' to 'open'
#
Tue Jun 30 14:41:32 2015 steve [...] deefs.net - Status changed from 'open' to 'rejected'
#
Tue Jun 30 15:39:44 2015 DMITRI [...] cpan.org - Correspondence added

This is interesting.  I see a benefit to having a deterministic behavior in this regard: one could use size and contents (minus the timestamp and other fixed-lenth stuff in PDF header and footer) to check for regression.
#
Subject:    [rt.cpan.org #105579]
Date:    Wed, 1 Jul 2015 11:43:05 -0400
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry

While it's not a critical problem, I agree that it would be nice for output to be more deterministic, so that integrity checks such as DMITRI proposes could be easily made. Let's think about why IDs are generated with timestamps rather than some deterministic counter, and would that be sufficient to make multiple PDF document runs (from the same source) essentially the same (except for header timestamps). It is possible that the original author was simply lazy, and picked a timestamp (hopefully
microsecond precision) for a unique ID, rather than going through the effort of tracking some global counter instead. What is "normal" practice for PDF generation in Acrobat and other packages?
#
Subject:    [rt.cpan.org #105579]
Date:    Wed, 16 Mar 2016 11:02:09 -0400
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry

See also #113084. It sounds like the same issue.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Since I always use regression tests for my software, having a means to generate deterministic PDF would really, really be great.
It's a pain to have to convert the PDFs to some bitmapped format just to detect whether they look the same. And then see the tests fail when the converter changes its algorithms...

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 387
    • View Profile
Please also keep up on ticket RT 113084. I have some comments in there.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 387
    • View Profile
A thought: as an interim measure, how about a command line flag or global setting, etc., to suppress the ~time() addition to the ID? That way you could get your deterministic PDF for testing, but a "safe" one (with time) for production, until this issue is settled?

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
I got partly useful results once by modifying the timestamp postfixes into a fixed 'fake' timestamp. I also used this timestamp in the CreationDate.
Also, I needed to defeat hash randomisation by setting PERL_HASH_SEED to a fixed value.
I do not recall whether that was sufficient since (for other reasons) I started using conversion to bitmap to compare the PDFs.

It would be nice to have a PDF open option -fixedtime => value that would also take care of  PERL_HASH_SEED (and, possible, other steps to create deterministic output).

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 387
    • View Profile
I've been taking a look through this stuff, and reading up on Perl, PDF, and related topics. One thing done here is a call to pdfkey(), which simply returns the next key value for use in IDs. It is initialized to 'CBA', and apparently goes CBB, CBC, … CBZ, CCA, … ZZZ, ZZZA, etc. Now, being initialized to the same starting value at each run (if I'm reading the code correctly), I would say there's a high likelihood of duplicate IDs in two runs of PDF::API2 that do the same thing. Therefore, trying to merge two similar output PDFs produced by PDF::API2 almost guarantees an ID collision. ~time() was apparently added to greatly reduce the chances of duplicate IDs (random-appearing number), although it is still possible. pdfkey() is used in a number of font-, colorspace-, and image-related methods. ~time() is added for some of the font-related methods. It's messy, inconsistent, and should be cleaned up.

If the purpose of the exercise is to produce non-colliding internal IDs, it should almost work (not currently used universally). However, this clashes with the desire of programmers to produce deterministic output, so that PDFs can be easily compared. So, we can't for example, seed pdfkey() with a timestamp or UUID or something else guaranteed to be unique across runs and platforms (besides, the very next key returned might be a duplicate anyway).

During a single run of PDF::API2, the use of pdfkey() should result in unique IDs within this document, for newly created IDs. However, importing (merging) of existing PDF file(s) could result in duplicates, which presumably is what led to the use of ~time(). I don't think it sufficient to allow unrestricted import of existing PDF code, and only check new pdfkeys for duplication, as someone might possibly import two or more PDFs, which duplicate each other. I think it will be necessary to check each imported (old) ID against the current list of all active IDs in the document, and update existing IDs if they are found to be duplicates. pdfkey() (actually, a wrapper around it with the rest of the ID) will also need to check if a duplicate is going to be issued, and skip over it. Simple incrementing of a counter (pdfkey) is not sufficient.

Another issue you mentioned was playing with/subverting PERL_HASH_SEED and PERL_HASH_DISABLE_KEY_RANDOMIZATION, to get more deterministic outputs. I did some reading on this (until my eyes glazed over), and apparently behavior has been changed several times in Perl for the purposes of speed and security (make it difficult for a hacker to determine the hash randomization seed and thus predict behavior). It is suggested that if you want deterministic behavior, you should sort keys yourself. If PDF::API2 sorted various keys itself (either universally, or under some sort of debug flag), would that do the job? For example, to always alphabetically sort keys before outputting their content? Sorting only on debug might be a problem in that you can't be absolutely sure that a production run is working OK (it wouldn't be repeatable). Is there any place in PDF::API2 that we really need random ordering of the output of a hash? Do we always output selected fields in a given order?

Code: [Select]
           newly created ID                             imported ID                     list of active IDs

      create ID with pdfkey(). check          check ID against active list.           original and revised ID
      against active list. revise until       if duplicate, generate new              original and revised ID
      not a duplicate. put in active          non-duplicate. put original and         etc. (would be same if
      list.                                   revised in active list. change ID       a newly created ID).
                                              to revised. watch for use(s) of
                                              original ID downstream, and
                                              change to revised.

We don't have to necessarily make each pdfkey() unique (besides, PDFs from other sources might use different architectures for IDs). We only have to make the overall ID unique within this document. We need to recognize all IDs being imported (can the use be ahead of the definition?) and change them immediately, so that we don't have to backpatch once a formal ID is recognized.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Thanks for your extensive report.

It is initialized to 'CBA', and apparently goes CBB, CBC, … CBZ, CCA, … ZZZ, ZZZA, etc.

Yes, that is Perl's "magical increment on strings".

Quote
However, this clashes with the desire of programmers to produce deterministic output, so that PDFs can be easily compared.

Yes, but bear in mind that this might only be needed for regression testing.

Quote
So, we can't for example, seed pdfkey() with a timestamp or UUID or something else guaranteed to be unique across runs and platforms (besides, the very next key returned might be a duplicate anyway).

Focussing on PDF::API2, it would be sufficient if the seed can be specified with the new() method.
If -- again, for regression testing -- each document is given its own seed then all documents can be compared and merge operations will not risk id clashes.

Quote
Another issue you mentioned was playing with/subverting PERL_HASH_SEED and PERL_HASH_DISABLE_KEY_RANDOMIZATION, to get more deterministic outputs. […] Is there any place in PDF::API2 that we really need random ordering of the output of a hash?

I think not. Besides, sorted is just one of the many outcomes of random  ;D . It random is ok, sorted is ok as well.

For PDF documents, the security issues that led to PERL_HASH_SEED are not relevant.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 387
    • View Profile
I've been poking around the code on this one. It appears that import_page() and importPageIntoForm() put the imported page code into an AcroForm, safely encapsulating and isolating resources from any possible name collisions. I think. At least, the comments make that claim. However, it looks like open() and possibly openpage() might be open to problems if they load in an existing page, and then the code adds more resources, etc. to it. In such cases, some registry would have to be kept of the loaded resource names and if a collision is experienced with new additions, the new names would have to be changed at creation time. Something like that. Has anyone else narrowed down the possible collision points? By the way, the ~time() suffix was only used in a limited number of resource names, not all of them, so there may be other potential collision points.

We might consider whether it's feasible to import a page during open*, and add anything new outside of the AcroForm. However, what will happen if the code is supposed to modify or delete existing entries — will they be inaccessible if locked up in an AcroForm? What if this "hybrid" page is in turn opened, and more material added? Much more research is needed on this, and if anyone has already done it, I'd appreciate hearing from them. I want to avoid ad hoc solutions and empirical evidence of where there is a collision threat and how to avoid it, by properly understanding just what's going. Surely, Adobe products and other PDF editors must have already dealt with this problem: how?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 387
    • View Profile
I've done more investigation and testing on this. I have been unable to find any reason why ~time() should be added to font resource names. If I open() an existing file and create a page() (whether at the end, or inserted between existing pages), the existing file is unaltered and the new page is added at the physical end of the file, after the original trailer and %%EOF. There is no mention of existing objects, except that a couple of new objects get the same object number (do they replace the originals?). Likewise, an open() and openpage() to append more content to an existing page just append to the physical end of the file.

Can anyone think of any operations where a name collision might actually happen? The only thing I see at the moment is that font names are condensed down to 4 characters, and I don't know whether that's guaranteed to be unique when on the same page, although the pdfkey() probably would help keep names unique. Since in any one run, time() is probably the same (second resolution) across any page (at least), I suspect that any collision problems with that would have shown up long ago.

makeBase.pl  writes Base.pdf, a two page document with several fonts
pdfVerdBold.pl writes Verdana3b.pdf, which adds a third page to Base with Verdana regular and bold
pdfOpenPg.pl writes Extend1.pdf, which adds another line to page 1 of Base
pdfPg1.5.pl writes Verdana1.5.pdf, which inserts a page between Base's 1 and 2, repeating a font

I'm beginning to suspect that there weren't actually any resource name collisions encountered, but someone thought "better safe than sorry" and added the ~time() to almost definitely make the names unique. Unless any evidence shows up that using pdfkey() alone is insufficient, I will remove (comment out) the ~time() portion of the names in version 3.003.

Now PDF files should be deterministic (repeatable), and are a bit shorter than before.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 387
    • View Profile
To be released (code with ~time() commented out, but not erased) in 3.003.