Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

CTS 9 - Non-latin1 characters missing with corefonts

  • 5 Replies
  • 1667 Views
*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
CTS 9 - Non-latin1 characters missing with corefonts
« June 22, 2017, 04:45:51 AM »
When using corefonts (Times-Roman et al.), non-latin1 characters are missing in the pdf even though they are (should) be present in the font. See the attached program for an example.

<add CTS 9 to subject - Mod.>
« Last Edit: July 07, 2017, 01:27:17 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: CTS 9 - Non-latin1 characters missing with corefonts
« Reply #1: June 22, 2017, 12:09:38 PM »
When 020_corefonts is run for Times-Roman and latin1 encoding, t-caron and c-caron do show up in the listing (as does e-acute), so they are in the font. I'm not sure if this is times.ttf (I had to change the name in the fourth test from Times-Roman.ttf) or another file. e-acute is proper Latin-1/Windows-1252, while t-caron and c-caron are actually Latin Extended-A. However, they happen to be in the Times-Roman font and available under "latin1". It's not clear exactly what constitutes "Latin-1" in the eyes of the font designer, but in most fonts, under PDF, it seems to be close to Windows-1252 (Latin-1 + Smart Quotes), plus a few odd characters and ligatures. Times-Roman under utf8 encoding shows similar glyphs, though at different code points. They are disjoint sets. All three characters are found in ISO-8895-2, which presumably is why you tried that single-byte encoding.

In your example file, you have use utf8;, so presumably all valid multibyte sequences are being treated as UTF-8 characters. If I remove use utf8;, all special characters (except third example ISO-8859-2) are displayed as pairs of Latin-1 bytes (and UTF-8 shows "tofu"), which is not surprising. I see that when you give -encode => "UTF-8" in the second example, the e-acute is "tofu" (invalid in some way), as are t-caron and c-caron. That doesn't seem right. However, under default (latin1) encoding in the first example, the UTF-8 source e-acute is recognized. I'm not surprised that t-caron and c-caron are not recognized under default latin1 encoding, as they are not properly Latin-1.

  • Example 1, corefonts with default (latin1) encoding: I don't think it should be considered a problem if non-Latin-1 characters (t-caron and c-caron) don't show up when using "latin1" encoding, even though they are defined in the font.
  • Example 2, corefonts with explicit "UTF-8" encoding, none of the non-ASCII special characters are recognized. Same with "utf8" encoding. Is "-encode" supposed to be used when the input stream is already UTF-8? I would expect all three special characters to be recognized under UTF-8. First encodeing to 'utf8' doesn't help — now all three characters are tofu'd.
  • Example 3, corefonts with input first converted to Latin-2, and displayed as encoded Latin-2, works. The UTF-8 special characters are all recognized by encode.
  • Example 4, ttffont with default encoding, works.

Should this be treated as a bug of some sort? That is, correct UTF-8 sequences are not being recognized under default and UTF-8 encoding? Is there an official definition of what you should get with UTF-8 input (and use utf8) and various encodings? I'm not sure why 020_corefonts shows t-caron and s-caron under "latin1" encoding, but perhaps it has something to do with the input stream not being true UTF-8, but built character-by-character. I don't see a problem with corefonts rejecting characters which are not true Latin-1, when "latin1" encoding is specified, but it needs to be consistent with other encodings and inputs.
« Last Edit: July 07, 2017, 01:27:36 PM by Phil »

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: CTS 9 - Non-latin1 characters missing with corefonts
« Reply #2: June 22, 2017, 04:58:23 PM »
Example 1, corefonts with default (latin1) encoding: I don't think it should be considered a problem if non-Latin-1 characters (t-caron and c-caron) don't show up when using "latin1" encoding, even though they are defined in the font.

Are you sure the default encoding is Latin1? And if so, shouldn't this be documented?
(Actually I expect it to be WinAnsiEncoding.)

Quote
Example 2, corefonts with explicit "UTF-8" encoding, none of the non-ASCII special characters are recognized. Same with "utf8" encoding. Is "-encode" supposed to be used when the input stream is already UTF-8? I would expect all three special characters to be recognized under UTF-8.

So do I.

Quote
Should this be treated as a bug of some sort? That is, correct UTF-8 sequences are not being recognized under default and UTF-8 encoding?

It surely does not match my expectations.

Quote
I'm not sure why 020_corefonts shows t-caron and s-caron under "latin1" encoding,

Latin1 encoding is a single-byte encoding and therefore has characters with ordinals 0 through 255. As you can see from the source, the characters are written by using uniByEnc which seems to return code points for values > 255. The t-caron et al are above 255, hence not part of any single-byte encoding.

Quote
I don't see a problem with corefonts rejecting characters which are not true Latin-1, when "latin1" encoding is specified, but it needs to be consistent with other encodings and inputs.

Exactly. If the default encoding of a corefont is latin1 (which I doubt) then the behaviour is explainable. However, if encoding UTF-8 is valid for a corefont (which I do not know for sure) then I expect example 2 to work.

The bottom line is that currently it is not possible to use many characters with corefonts, even though these characters are available in the font.
« Last Edit: July 07, 2017, 01:27:56 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: CTS 9 - Non-latin1 characters missing with corefonts
« Reply #3: July 02, 2017, 08:24:49 AM »
Just to touch base on this, I've been looking into whether the text_* methods properly display the given encodings. I'm trying to make a utility to do this in general, and have run into some problems:
  • Steve just made some changes to suppress circular references ("weaken"), and they seem to have broken font handling. I need to get some definite test cases before I open a bug report. So far, commenting out all the new weaken statements seems to work.
  • A number of allowed latin* encodings seem to work, but I'm having problems with UTF-8. Even though the core fonts have Unicode values in them, they don't seem to want to show anything for -encode=>'utf8'. I'm still looking at it. There are claims on StackExchange that corefonts simply don't do UTF-8 — if so, perhaps "utf8" should be removed from the allowed encodings.
While 020_corefonts apparently uses the text() method, it may be doing some strange under-the-covers stuff. I'm still looking into it.
« Last Edit: July 07, 2017, 01:28:25 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: CTS 9 - Non-latin1 characters missing with corefonts
« Reply #4: July 07, 2017, 11:41:35 AM »
I've been busy with numerous changes flowing over from the official PDF::API2 v2.033, as well as trying to get PDF::Builder in shape to release, but hope to get back to this issue real soon.

I notice there is a file PDF/API2/Resource/uniglyph.txt, which lists all the Unicode glyphs. Besides being severely outdated, it's got some odd assignments. For instance, it lists a Euro under both 0x0080 and 0x20AC. The latter is the correct Unicode point, but the former is a Windows Smart Quotes point. In Unicode, 000x, 001x, 008x, and 009x are all supposed to be control characters, but uniglyph.txt lists printable glyphs for them. I don't know yet if that's why they show up under "latin1" encoding, but I'm wondering if this should all be brought up to standards (e.g., "latin1" doesn't show glyphs in 0x, 1x, 8x, and 9x) and add a "cp1252" to show latin1+Smart Quotes.

Thoughts? Would you expect 'latin1' encoding (when specified) to show glyphs only in 2x-7x and Ax-Fx, and 'utf8' to reserve 0x-1x and 8x-9x for controls, or might this break a lot of applications (admittedly, poorly coded)?
« Last Edit: July 07, 2017, 01:28:48 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: CTS 9 - Non-latin1 characters missing with corefonts
« Reply #5: December 25, 2017, 01:39:33 PM »
 PhilterPaper commented on Jul 23

I've been doing some research on this, and most sources claim that Adobe core fonts are simply incompatible with UTF-8 (e.g., see https://github.com/dompdf/dompdf/wiki/About-Fonts-and-Character-Encoding and http://www.perlmonks.org/?node_id=954373). Apparently they are expecting a single byte stream (limited to 256 - 32 glyphs)... see the PDF object mapping ordinals 0..255 to the glyph identifiers. It is possible to select a "plane" of 223 characters + space, and I give an example in the revised missing.pl. However, to switch back and forth among planes is quite clumsy (unless you just have one or two such characters on a page). We may just have to accept that core fonts and UTF-8 are incompatible, unless we do something in the code to automatically remap UTF-8 characters to single byte characters on a plane. That could really bulk up the size of a PDF file if it has to constantly be resetting the font!

See examples/020_corefonts for an example of using multiple planes for a font to support more than 256 - 32 glyphs. PSFonts probably have the same limitation, but TTFonts apparently are happy to handle UTF-8. Does anyone else have useful information?

missing.pl

 PhilterPaper commented on Aug 25

I am in the process of updating the POD (documentation) to show how corefonts and psfonts are limited to single byte encodings, and are incompatible with UTF-8 and other multibyte encodings. Users with extended character range needs will probably be encouraged to use TTF/OTF, which is UTF-8 compatible. The use of automap() for multiple planes will be documented, if you care to go that route. Very likely, this issue will be closed without further action, although I will leave it open for now until the new documentation, etc. settles down.

 PhilterPaper commented on Sep 24

The documentation (POD) for corefonts and Type1 (PS) fonts has been updated to reflect the limit to single-byte encoding, and how to use automap() to access font glyphs that are not in the standard encoding chosen. The documentation (POD) for TrueType (and OpenType) fonts has been updated to discuss the UTF-8 compatibility of this font type.

 PhilterPaper closed this on Sep 24