Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

CTS 23 - Character sets and encoding

  • 16 Replies

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 604
    • View Profile
Re: CTS 23 -Character sets and encoding
« Reply #15: September 08, 2017, 10:49:28 PM »
The file PDF::Builder::Resource::uniglyph.txt contains the mapping of Unicode ordinals to glyph names, and vice-versa. It is used to update whenever it is changed. Most of it looks OK, but there are some questionable areas.

In the range 0x00 – 0x1F there are a number of combining accent marks. The 02__ entries are listed as "spacing modifier letters". The 0x3__ entries are listed as "combining" diacritic marks for Unicode, but either should be usable for PDF, unless someone wants to write code to position the accent mark over or under the adjoining letter! Pick your poison.

  • 0x01  acute  ´  should be 0xB4 or 0x02CA (uniglyph.txt lists only as Mandarin Chinese second tone, not as acute) or 0x0301
  • 0x02  caron  ̌   (a.k.a. hacek) should be 0x02C7 or 0x030C
  • 0x03  circumflex  ˆ  should be 0x5E or 0x02C6 or 0x0302
  • 0x04  dieresis  ¨  should be 0xA8 or 0x0308
  • 0x05  grave  ̀  should be 0x60 or 0x02CB (uniglyph.txt lists only as Mandarin Chinese fourth tone, not as grave) or 0x0300
  • 0x06  macron  ¯  should be 0xAF or 0x02C9 (uniglyph.txt lists only as Mandarin Chinese first tone, not as macron)  or 0x0304
  • 0x07  ring  ̊  should be 0x2DA or 0x030A
  • 0x08  tilde  ˜  should be 0x7E or 0x02DC or 0x0303
  • 0x09  breve  ̆  should be 0x02D8 or 0x0306
  • 0x0A  ogonek  ̨  should be 0x02DB or 0x0328
  • 0x0B  dotaccent  ˙  should be 0x02D9 or 0x0307
  • 0x0C  hungarumlaut  ̋  (a.k.a. double acute accent) should be 0x02DD or 0x030B
  • 0x0D  cedilla  ¸  should be 0xB8 or 0x0327
  • 0x0E  dblgrave  ̏  should be 0x030F
  • 0x1E  dotlessi  ı  should be 0x0131
  • 0x1F  dotlessj  ȷ  should be 0x0237 (is currently missing from uniglyph.txt)

I haven't tested it, but I would assume that it would be a manual operation to build up an accented character by putting down the base letter and then aligning the accent mark over it. See the discussion on compositing accented letters.

In the range 0x80 – 0x9F there are a number of MS "Smart Quotes". Note that this is the set from CP-1252, and doesn't match up very well with other Smart Quotes-containing pages.

  • 0x80  Euro    should be 0x20AC
  • 0x81  bullet    should be 0x2022 (unassigned)
  • 0x82  quotesinglbase    should be 0x201A
  • 0x83  florin  ƒ  should be 0x0192
  • 0x84  quotedblbase    should be 0x201E
  • 0x85  ellipsis    should be 0x2026
  • 0x86  dagger    should be 0x2020
  • 0x87  daggerdbl    should be 0x2021
  • 0x88  circumflex  ˆ  should be 0x5E or 0x02C6 or 0x0302
  • 0x89  perthousand    should be 0x2030
  • 0x8A  Scaron  Š  should be 0x0160
  • 0x8B  guilsinglleft    should be 0x2039
  • 0x8C  OE  Œ  should be 0x0152
  • 0x8D  bullet    should be 0x2022 (unassigned)
  • 0x8E  Zcaron  Ž  should be 0x017D
  • 0x8F  bullet    should be 0x2022 (unassigned)
  • 0x90  bullet    should be 0x2022 (unassigned)
  • 0x91  quoteleft    should be 0x2018
  • 0x92  quoteright    should be 0x2019
  • 0x93  quotedblleft    should be 0x201C
  • 0x94  quotedblright    should be 0x201D
  • 0x95  bullet    should be 0x2022
  • 0x96  endash    should be 0x2013
  • 0x97  emdash    should be 0x2014
  • 0x98  tilde  ˜  should be 0x7E or 0x02DC or 0x0303
  • 0x99  trademark    should be 0x2122
  • 0x9A  scaron  š  should be 0x0161
  • 0x9B  guilsinglright    should be 0x203A
  • 0x9C  oe  œ  should be 0x0153
  • 0x9D  bullet    should be 0x2022 (unassigned)
  • 0x9E  zcaron  ž  should be 0x017E
  • 0x9F  Ydieresis  Ÿ  should be 0x0178
Note that a "bullet" is used for a number of unassigned positions. Its "true" position is x95.

Now, PDF does not make use of any of the control characters in these two ranges, either as input to PDF::Builder, or as output in the PDF. It is conceivable that some control characters from the lower range might be implemented at some point (tab and end-of-line/break come to mind), but those are not critical. The upper range maps pretty closely to MS Smart Quotes (WinAnsiEncoding), which many may already be using for their input (although it's not truly Latin-1, and varies slightly from code page to code page).

So, would moving the characters and accent marks in these two ranges break existing applications? Are they commonly enough used to warrant leaving in the lower 256 page (standard encodings), even though their Unicode points are wrong? All of the characters in the second range are also found in their "correct" Unicode positions, so it shouldn't matter which you use in your text. However, there is the question of what Encode will do to 8x and 9x characters it finds, when attempting to convert to UTF-8. If the input text's encoding is correctly given, and not assumed to be Latin-1, it might work.
« Last Edit: May 12, 2019, 09:09:24 PM by Phil »


Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: CTS 23 -Character sets and encoding
« Reply #16: September 10, 2017, 03:54:06 PM »
I think this uniglyph.txt predates the unicode support that was added to PDF::API2. It seems to be an attempt to bring (a basic form of) Unicode encoding to single-byte fonts. I wonder if this data is actually used anymore.
(It is definitely not used for TT/OTF fonts.)

My advice: remove the offending codes and see if something breaks.
« Last Edit: May 12, 2019, 09:09:45 PM by Phil »