Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

Character sets and encoding

  • 16 Replies
  • 3940 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Character sets and encoding
« March 30, 2017, 11:18:55 AM »
There are at least two important issues here:

  • How natively should PDF::API2 support Unicode and in what forms? Right now the code is more or less using ASCII, with one or two grudging nods to UTF-8. As Unicode is supposed to be the One Ring to Rule Them All (or at least, a superset of all encodings), should we do like Javascript does, and convert everything internally to UTF-16 (or perhaps UTF-8)? There would be an input encoding flag to tell PDF::API2 what the incoming text (and markup) is written in, internal processing would all be in one encoding, and on the back end recode to whatever the font in use at that time wants. Note that the font to be used (and thus its encoding) needs to be known throughout the process, so glyph widths, etc. may be determined.
  • What other character encodings should we work in? In the end, you need to produce the character set supported by the font(s) you will be using. Every font seems to have a slightly different take on what glyphs are needed (and that's just for Western European languages!), and we haven't even gotten into other alphabets around the world. I don't think I've seen a font yet that exactly matches a standard encoding such as ISO-8859-1 or CP-1252. At the least, PDF::API2 needs to have a well-defined fallback capability (see Feature Request CTS 5), and warn the user if certain characters can't be used at all. It should also make use of ligatures (such as "fi") where appropriate (and this is language-dependent).

I will add another "wishlist" item: what can we do to make source markup for PDF production easier on authors, particularly for helping them use typographically correct punctuation (see "Smart Quotes") without a lot of labor in marking it up? Converting a double-dash into an em-dash isn't rocket science, but figuring out whether an ASCII apostrophe ' should be ‘ or ’ depends on the context and even the language you're working in. Something compatible with HTML would be great, as it would directly lead into an HTML-to-PDF conversion tool. Or, should we just encourage authors to set up some sort of "compose" key on their keyboard, and type in the correct UTF-8 character manually? A glass keyboard pop-up can also be used.

There are lots of characters needed every day which simply aren't on most keyboards, such as em- and en-dashes, required blanks (non-breaking spaces), and even accented characters. What's the best way to get them into the source for a PDF? Finally, different languages (and even different fields using a language) have different rules on punctuation, order, spacing, etc. — it may never be possible (with reasonable markup) to have something flexible enough to work correctly with any language (in other words, such things will have to be hand-assembled in the source, and incorrect for  any other language).

Being able to bold, underline, strike out, color, resize, and italicize text in the markup would be useful. Multiple plug-in options might be made available, so markdown, BBCode, LaTeX markup, and others (in addition to HTML + CSS) might be usable.
« Last Edit: March 30, 2017, 11:41:53 AM by Phil »

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #1: March 31, 2017, 02:36:55 AM »
I think this is a non-issue.
PDF::API2 is a Perl module, and Perl already takes care of everything. When you convert your input to "Perl internal encoding" (using Encode::decode, something you should *always* do) everything works fine already.
Or am I overlooking something?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Character sets and encoding
« Reply #2: March 31, 2017, 02:08:01 PM »
OK, I see that there is already (at least) some use of Encode in PDF::API2, but haven't checked it thoroughly. It falls upon the developer using PDF::API2 and the end user of that application to properly specify what the encoding is of whatever text is being read in. If I write some text to use CP-1252 (Window's Latin 1 + "Smart Quotes"), naturally I would have to tell PDF::API2 what it's working with. I don't think that UTF-8 (or any other encoding, for that matter) can be reliably detected from just looking at the byte values. Should the task of converting input encoding to Perl internal fall upon the application developer, or on PDF::API2? And what exactly is Perl's internal encoding? If it is a superset of Unicode, in what ways?

So, is the encoding support adequate for all input text that PDF::API2 might be fed? I don't seem to recall seeing many (if any) references in the documentation to specifying the input encoding. It seems to be easy to feed it the wrong thing, and get unintended results. At the least, it sounds like the POD documentation needs to be updated to remind users about encoding issues. Keep in mind that the input encoding could change from file to file within one run of an application using PDF::API2 (e.g., the main text and various "include" text files).

At the other end, Perl needs to match up its internal representation to what a particular font supplies (often more than 256 characters, but not full Unicode). In your CTS 5 feature request, you show that apparently we need to do something about using fallback fonts if a glyph isn't available in the desired font, something like CSS's font-family list.  Do all fonts list the Unicode value of a glyph, so PDF::API2 could check to see if it's available in that font? Is it a simple extension to processing to temporarily switch to another font? (And do application developers have to open these fonts for PDF::API2, or should PDF::API2 do it based on a given list of font names plus other information?) Remember that the font to be used for a particular character needs to be known very early in the process, so that the width of any glyph (for example) may be looked up.

Finally, is any of this going to have an impact on the minimum Perl version allowable, and the PDF version being output (anything greater than PDF 1.4)?

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #3: April 01, 2017, 07:08:41 AM »
No, no, no... It's all there already.

Quote
If I write some text to use CP-1252 (Window's Latin 1 + "Smart Quotes"), naturally I would have to tell PDF::API2 what it's working with.

This is the fundamanetal misconception, unfortunately one that many people make.
You don't tell PDF::API2 what it is working with. You tell Perl, and everything is transparant from there on.

For example, you read a string from a file. This file is encoded in, say, ISO-8851.1. So what you do is basically:
Code: [Select]
$line = readline($filehandle);       # line is external encoding
$data  = decode("ISO-8859.1", $line);   # data is now Perl internal encoding
$text->text($data);       # add to PDF

PDF::API2 (and so should all modules) expects the data to be in Perl internal encoding.

Should you want to write this data yourself to a file encoded in UTF-8:
Code: [Select]
print $outfile ( encode_utf8($data) );'print' expects raw data (bytes), just like readline delivers raw data.

<fixed broken quote -- Mod.>
« Last Edit: April 01, 2017, 07:56:25 AM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Character sets and encoding
« Reply #4: April 01, 2017, 08:32:57 AM »
Well, you tell the PDF::API2-using application what encoding your incoming file(s) are in, which in turn PDF::API2 tells Perl (if PDF::API2 is reading the file(s) itself). If your application is simply handing off text strings to PDF::API2 (normally the case?), it (your application) would be responsible for first getting the data into Perl internal form. Is that correct? I have a feeling we're heading into a violent agreement.

Is there some place that documents what the Perl internal format (encoding) is? Is it UTF-8, Unicode (UTF-16, like Javascript), or something else? I've seen it referred to as a "superset" of Unicode or UTF-8, so I'm not sure what it is. Reading through http://perldoc.perl.org/perlunifaq.html and http://stackoverflow.com/questions/15170982/what-the-heck-is-a-perl-string-anyway (among others), I see it described in several different ways — ASCII, bytes if a single byte encoding, UTF-8 otherwise — so what can I count on? This source: http://plosquare.blogspot.com/2009/04/viewing-internal-representation-of.html says that there is an internal flag for each string saying whether it's UTF-8 or some unspecified single-byte encoding. That would not help distinguish between ISO-8859-1 and CP-1252. For instance, in some proposed PDF::API2 code there was a soft hyphen (&SHY;) coded as UTF-8. Since 7-bit ASCII is a proper subset of UTF-8, can I just count on anything > 0x7F in a string being UTF-8? What if I have (either read in from a file, or hard coded in the Perl code), say, a Latin-1 non-breaking space (&nbsp;)? It is 0xA0 in either case, so what will Perl do with it (it's not legal UTF-8) if I don't do any explicit encoding? For anything not ASCII, do I have to be careful to input it as UTF-8 if it's hard coded in a Perl program? If I want to look inside a Perl string for some specific content, can I always assume UTF-8, or might it be something else?

Since PDF::API2 knows the internal encoding of the font files it's using (the mapping of Perl internal character representations to the correct glyph), can I assume that everything is handled for me on the output end? There's still the issue of determining whether a desired glyph is found in the specified font, and if not, how we will fall back through a list of fonts to find it. Unless we are writing to a file in a specific non-UTF-8 encoding, we won't have to do any decode()? Also, is Perl 5.8 going to be sufficient to do encoding and decoding, or do we need to require a higher version?
« Last Edit: April 01, 2017, 08:41:03 AM by Phil »

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #5: April 02, 2017, 12:55:56 PM »
AFAIK, PDF::API2 does not read user data files.
If it were to read (textual encoded) user data files yes, then it would need to know the encoding.
But in general you pass strings to PDF::API2, and these strings are (should be, must be) in Perl internal encoding.

Perl internal encoding is basically UTF-8, but it falls back to ISO-8859.1 if it can for efficiency. However, you should never have a reason to care about that.

All relevant font operations that work with glyphs have Unicode versions, e.g. encByUni.

Perl version 5.8.3 is considered a low end for good Unicode handling. 5.8.6 and 5.8.8 are pretty decent. Given that these versions are very old a requirement of 5.10 or later should not be problematic.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Character sets and encoding
« Reply #6: April 03, 2017, 09:25:19 AM »
I played around with some test code yesterday, while implementing hyphenation for text-fill operations (soft hyphens are non-ASCII). It doesn't seem too hard to call $string = Encode::decode(MY_ENCODING, $source), to get non-ASCII text into the normal (UTF-8) internal format. I will make a note to beef up the POD documentation to remind users that any strings with non-ASCII characters should be decoded. While true ISO-8859-1 (Latin-1) text is OK to leave undecoded, many people on Windows machines may well end up supplying CP-1252 encoded text strings (a variation of Latin-1), which will cause problems if they use any "Smart Quotes" and other 0x8n and 0x9n characters. It's probably safer to encourage people to always decode, just to get them into the habit.

I don't recall seeing PDF::API2 reading in any files other than font files, so it's probably true that the responsibility for proper encoding will be external to PDF::API2. We also need to keep an eye open for PDF::API2 code using anything other than ASCII text. For example, there was a proposed change for hyphenation, where the soft hyphen was encoded as UTF-8 (two bytes). That's not safe, as someone could give Latin-1 text with a single-byte SHY. Anyway, I have been using the ord() call to get around this for non-ASCII characters — the ordinal value is the same in Latin-1 and UTF-8, whether the character is single byte or multibyte.

Anyway, with your prodding me in the right direction, I think I now understand the ins and outs of what character encodings Perl (and PDF::API2) is using. And it would probably be a good idea to encourage use of Perl 5.10 or higher, especially if someone wants to use non-ASCII text.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #7: April 04, 2017, 03:35:28 AM »
Quote
While true ISO-8859-1 (Latin-1) text is OK to leave undecoded, many people on Windows machines may well end up supplying CP-1252 encoded text strings (a variation of Latin-1), which will cause problems
You hit the nail on the head...

Quote
For example, there was a proposed change for hyphenation, where the soft hyphen was encoded as UTF-8 (two bytes).
Quote
Link?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Character sets and encoding
« Reply #8: April 04, 2017, 11:25:05 AM »
Quote
Link?

First link in RT 98548: https://www.catskilltech.com/forum/feature-requests/rt-98548-hooks-for-line-splitting/ . Unfortunately, the link seems to have evaporated when Steve left bitbucket. I do recall it mentioning a two byte sequence \xC2\xAD, which is UTF-8 for a soft hyphen.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #9: April 05, 2017, 02:23:10 AM »
Yet another issue that is nonexistent provided the input data is properly decoded.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #10: April 05, 2017, 02:28:18 AM »
With regard to glyph detection: I must admit I haven't yet been able to found out what calls to use to determine whether a particular glyph is present in a font.
I tried PDF::API2 font operations (e.g. uniByMap etc) as well as Font::TTF calls.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Character sets and encoding
« Reply #11: April 05, 2017, 08:59:38 AM »
Yet another issue that is nonexistent provided the input data is properly decoded.

Not quite. If I'm looking to match a soft hyphen, the code will have to know whether it's to look for ISO-8859-1 (single byte xAD) or UTF-8 (double byte xC2xAD). Or, I can take the ord() of a character and see if it's 173. Still a bit messy.

Regarding glyph detection, maybe take a walk through the examples/ directory and look at the code that prints out all the glyphs in a font (e.g., 020_corefonts). That code seems to be able to tell if a glyph is defined for a given code point.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #12: April 05, 2017, 03:48:45 PM »
Quote
Not quite. If I'm looking to match a soft hyphen, the code will have to know whether it's to look for ISO-8859-1 (single byte xAD) or UTF-8 (double byte xC2xAD)

You are thinking bytes again  :).
Once the string has been decoded into Perl internal encoding, there is no ISO-8859-1 to search for. You search for "­" (which is an actual soft hyphen between quotes but it seems invisible) or "\x{ad}". That the internal encoding consists of a single byte \xad or multi-byte \xc2\xad is irrelevant. Perl takes care of that.

Quote
Regarding glyph detection, maybe take a walk through the examples/ directory and look at the code that prints out all the glyphs in a font (e.g., 020_corefonts). That code seems to be able to tell if a glyph is defined for a given code point.

Apparently for the built-in corefonts only. Can you get it to work with an arbitrary TT font?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Character sets and encoding
« Reply #13: April 05, 2017, 04:30:36 PM »
So I can give a Latin-1 value (UTF-8 ordinal?) as a hex value, and Perl will promote it to UTF-8 multibytes if I'm comparing against a UTF-8 string? Cool! I'll have to try it out some day. Or do both strings have to be the same encoding, even constants being used in a comparison?

Have you tried 022_truefonts? It is supposed to take a single .ttf file as an argument. It might have still been broken in 3.001, but I know I fixed it in 3.002.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Character sets and encoding
« Reply #14: April 06, 2017, 03:31:56 PM »
If by "UTF-8 Ordinal" you mean Unicode code point, yes. UTF is an encoding and does not have ordinal values.
When using constant strings, these are automatically decoded into Perl internal encoding when you put "use utf8;" near the beginning of the source file(s).