Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

CTS 16 - Consistent code point handling across font types

  • 1 Replies
  • 2132 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 809
    • View Profile
CTS 16 - Consistent code point handling across font types
« December 25, 2017, 03:27:48 PM »
 PhilterPaper commented on Nov 3

Ref RT 120048/#47

Core fonts and Type1 fonts are currently limited to single byte encodings, and use the automap() method to map their glyphs over multiple planes (of up to 256 glyphs each). It would be good to extend them in some way to handle UTF-8 text, so that one would not need to constantly switch between subfonts (planes) to see and use all the glyphs in a font (see 020_corefonts, 021_psfonts, 021_synfonts). Is there any way to natively use UTF-8 with these font types? We want to avoid automatically running automap and switching planes under the covers, as this would be very bulky and slow. Also, automap does not guarantee that the same code point will map to the same glyph over different versions of a given font file!

On the other hand, TrueType and OpenType fonts are UTF-8 ready, but utilities such as 021_synfonts need to be extended to show glyphs beyond the first page (plane 0). 022_truefonts shows plane 0 per the encoding, but everything else is listed by CID (Character ID), arranged by Unicode point. Perhaps automap() could be written to handle this? We want 021_synfonts to display all glyphs for a TrueType font.

The idea is to get consistent text handling, regardless of what kind of font (core, Type1, TrueType, etc.) happens to be used. If you're content to stay in a single byte encoding, you can do that (although automap should continue to be supported for legacy purposes). If you want to use UTF-8 with core or Type1 fonts, to seamlessly access all glyphs by Unicode point, you should be able to do that.

 PhilterPaper commented on Nov 15

In Type1 (and possibly core) fonts, there is Unicode point information, so in theory we can determine the glyph number (GID, G+nnn) for any desired Unicode character at document creation time. However, the current output mechanism is based on a map of single byte-to-glyph name, and something else would have to be found.
« Last Edit: May 12, 2019, 08:07:41 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 809
    • View Profile
Re: CTS 16 - Consistent code point handling across font types
« Reply #1: May 06, 2021, 03:07:03 PM »
For the single-byte encodings that Core and Type1 routines support, UTF-8 support could be added with a glue layer to build one or more custom encoding tables per PDF (or per page). Thus, a given font file might see two or more subfonts, each with an encoding table of up to 220+ glyphs (Unicode points). and selecting which subfont to use on-the-fly. This of course could mean the possibility of frequent switching of font objects on a given page, but would be the price to pay for allowing UTF-8 encoding (mapped to one of the subfonts).

Matters might be improved by doing subfont(s) for each page, rather than globally. Fill up one subfont encoding table, then start the next one from scratch (empty). The second table would probably largely overlap the first (especially the ASCII content), but there would be very little switching back and forth between subfonts needed. Hopefully, no more than one or two subfont tables would be needed for a page, and maybe for an entire document, assuming you're not doing something like a font dump (ttfont, synfont, etc.).

I think that UTF-8-to-single-byte mapping tables would be "first come, first served", rather than trying to maintain Unicode (or Latin-1) ordering over any part of it. That way you could fill 256 glyphs per subfont, and fit the next chunk of points so frequent swapping isn't needed. Even fairly pathological cases such as font dumps, while needing perhaps dozens of subfont table, are feasible.