Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

[RT 128674] error "requested cmap '' not installed" with many CJK fonts

  • 46 Replies
  • 199 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
Quote
the .cmap files should be updated to their latest revision, since they are still needed to support legacy fonts and cff-fonts without an internal cmap. AFAIK each revision build on top of the former so higher revisions should be compatible with lower revisions besides the actual glyph-counts.

OK, I'll take a look at whether I can generate new .cmap files from the online sources. You wouldn't happen to still have any utilities for doing this? Are the four .cmap files sufficient (when updated), or should we support others? If so, it would probably be better just to ship some generator utility and let users create their own, rather than bloating PDF::Builder with lots of new .cmap files.

Quote
i have looked at the format 14 spec and it deals largely with variant glyphs which could impact cjk and/or arabic probably some more, but i am not an expert on this.

Indic languages (which do a lot of ligatures, character substitutions, and moving stuff around) might also have a major impact. One thing on my plate is RT 113700, which requires implementing GSUB and GPOS.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
Quote
OK, I'll take a look at whether I can generate new .cmap files from the online sources. You wouldn't happen to still have any utilities for doing this? Are the four .cmap files sufficient (when updated), or should we support others? If so, it would probably be better just to ship some generator utility and let users create their own, rather than bloating PDF::Builder with lots of new .cmap files.
my former perl utilities are all lost to bit-rot or hd-failures.

you should get away with having only to update those already present.

Quote
Indic languages (which do a lot of ligatures, character substitutions, and moving stuff around) might also have a major impact. One thing on my plate is RT 113700, which requires implementing GSUB and GPOS.
the only code that i know that implements this correctly is either "harfbuzz" or "libicu" (both C/C++).

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
bobh0303 (Bob Hallissy) said:
Quote
    so actually two [sic] things need to happen (in that order)

    1 - patch to builder and API2 to switch to font cmap if cff encoding is identity (as above).
    2 - patch Font::TTF:Cmap find_ms to prefer (platform:encoding) of 0:4 instead of 3:10/3:1.
    3 - patch Font::TTF:Cmap to support format 14 cmaps and add preference for 0:5 above all.

Re (3):

We're open to proposals (or PRs) for how to support format 14 cmaps. The immediate problems are that

A) the current interface assumes a cmap is a mapping from a codepoint to a glyph:
Code: [Select]
val
A hash keyed by the codepoint value (not a string) storing the glyph id

Format 14 cmaps are not such, but instead map each supported Unicode Variation Sequence (UVS) to a glyph.

B) the format 14 cmap is only a partial cmap that supplies the non-default glyphs for those UVS that override the default mapping contained in the font's "Unicode cmap" (quoting the spec).

Whatever structure we decide to represent the format 14 cmap, I don't think find_ms() should ever return such, but should return the corresponding Unicode cmap.

Re (2):

I'm not sure I understand the arguments for changing the preferences as suggested -- feel free to create an issue on Font::TTF and make such arguments.

You can, of course, write your own code to find your preferred cmap. And, in case you're not in control of all the calls to find_ms(), Perl allows you to replace Font::TTF::Cmap:find_ms with your own function.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
Hi Bob, thanks for joining in and giving us the Font::TTF perspective!

Alfred, I've been trying to generate replacement *.cmap files, but I haven't quite gotten there. The best data seems to be the "cid2code.txt" files, but I haven't been able to quite match the existing files. For example, trying to create a new japanese.cmap, I find the cid2code file has multiple Unicode values for some CIDs, and there seems no rhyme or reason to which one is in japanese.cmap for that CID. Some CIDs have no Unicode values at all, while some value is given in japanese.cmap (which I don't know where it came from). I've checked the first few hundred entries, and found quite a few of these problems. No way am I going to check all twenty-thousand-plus entries by hand! I've got to get something that can be fully automated.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
bobh0303 (Bob Hallissy) said:
Quote
I did some playing around, dumping cmap data in TrueType/FontFile.pm in the non-.cmap section. My file NotoSansJP-Regular.otf has 5 tables in the cmap section. The first 3 are Platform 0 and the last 2 are Platform 3. All are Ver 0. The encodings are 3,4,5,1,10. ms_enc() reports that the encoding is 10, so apparently it went with the fifth table.

fwiw, I retrieved https://github.com/googlei18n/noto-cjk/blob/master/NotoSansJP-Regular.otf and note there are 6 cmaps:
  • 0/3, 3/1 -- these two point to the same format4 subtable in the file
  • 0/4, 3/10 -- these two point to the same format12 subtable in the file
  • 0/5 -- format14 subtable (for variation sequences)
  • 1/1 -- format6 subtable (Macintosh Japanese scriptmanager code; Apple discourages these)

Note that the 0/3 and 3/1 maps are the exact same data in the font (not just a copy -- the two cmap headers point to the same location in the file), and similarly for 0/4 and 3/10.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
Hmm. I don't recall seeing that last one (platform 1). Does Font::TTF properly handle it? It may be a moot point, if Apple doesn't want it used any more. As I said, Font::TTF apparently chose 3/10.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
terefang (Alfred Reibenschuh) said:

looking at https://docs.microsoft.com/en-us/typography/opentype/spec/name#platform-specific-encoding-and-language-ids-unicode-platform-platform-id--0

 - for the current implementation of api2/builder the best cmap(s) around would be 0/6, 0/4 and 3/10 in that order. - if it happens that those are actually the same,nothing better can happen. - cmap 0/4 has unicode 2.0 context, whereas cmap 0/6 has full unicode support.

so the preference order would be 0/6, 0/4, 3/10, 0/3, 3/1 for script fonts and 3/0 for symbol fonts. in microsoft environments the preference would change to 0/6, 3/10, 0/4, 3/1, 0/3 for script fonts

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
terefang (Alfred Reibenschuh) said:

i recommend using the Adobe-*-UCS2 files from: https://github.com/apache/pdfbox/tree/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap for cmap generation

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
Quote
i recommend using the Adobe-*-UCS2 files from: https://github.com/apache/pdfbox/tree/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap for cmap generation

  • Is, for example, Adobe-Japan1-UCS2 the follow-on to Adobe-Japan1-6? They don't seem to be in the same format, and other sources have a Adobe-Japan1-7 (though not with the information I need).
  • In Adobe-Japan1-UCS2, I don't see anything that quite looks like a CID-to-Unicode mapping that I need. Do you know how to read this file? The file I was trying to use has a decimal CID, with a variety of Unicode (and other) values for each (although sometimes missing any, and sometimes multiple Unicode points)

Clear? Huh! Why a four-year-old child could understand this report. Run out and find me a four-year-old child. I can't make head or tail out of it.
-- Groucho Marx (Rufus T. Firefly, Duck Soup)

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
Bob Hallissy on RT system said:
Quote
<URL: https://rt.cpan.org/Ticket/Display.html?id=128768 >

AFAICT, the request is to change the order in which find_ms() locates a cmap, specifically to prefer the PlatformID=0 ("Unicode") cmaps over the PlatformID=3 ("Microsoft") cmaps.

At this point I haven't seen any reasons put forward as to why this change was being requested, and since there are lots of fonts that have Microsoft cmaps but no Unicode cmaps, I see no benefit in making the change.

Additionally, application writers can, of course, write their own code to find their preferred cmap. And, in case they're not in control of all the calls to find_ms(), they can replace Font::TTF::Cmap:find_ms with their own function.

128674 has been rejected. It sounds like Font::TTF will not be modified to accommodate Alfred's suggested revision. So where does that leave us? Should we replace the find_ms() method with our own version, and if so, what is the justification? Keep in mind that PDF::Builder can be used on a wide range of platforms, including Linux, Mac, and Windows. A fixed lookup sequence may not be desirable. I assume that the Unicode-to-CID mapping is only of concern to the PDF writer (producer), and irrelevant to the reader (at least, if the font is embedded... what happens if the font has to be on the reader (consumer) machine?).

Further thoughts, given that Font::TTF itself is not going to change? And does this tie in at all with the *.cmap file usage?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
bobh0303 (Bob Hallissy) said:

If there is a functional reason to change find_ms() I'm open to doing it but I haven't heard a reason yet.

I guess it boils down to what behavior or problem are you having with the current code? It sounds to me like it is just a personal preference or bias against PlatformID=3. What am I not seeing?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
I have no idea. Only Alfred (@terefang) can tell us the reasons for his request. My skin in the game is ensuring that PDF::Builder works correctly for as many people as possible -- I'm open to any and all reasonable ideas.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
So Alfred, would Bob's find_cmap() routine (I think it stands independently of Font::TTF innards) do the job for you? Would the lookup list be an option to ttfont() or a new parameter or something else? And how would you prefer to handle MS vs non-MS systems? Does this pertain to just the producer (writer), or also to the consumer (reader) depending on whether or not the font is embedded? What should default behavior be (e.g., same as today)?

Once the selection of a built-in cmap is out of the way, what remains to be done with the four *.cmap files? I still don't have a good update for them. Should they continue to be used in preference to any built-in font cmap, but just for those 4 specific cases?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 513
    • View Profile
bobh0303 (Bob Hallissy) said:

Quote
Bob's find_cmap() routine (I think it stands independently of Font::TTF innards)

Actually it does one thing that depends on knowledge of Font::TTF innards: it sets an internal object variable (' mstable') so as to remember the cmap that was actually found. This variable is then used by Font::TTF::Cmap::ms_lookup() to find glyphs using the designated cmap.

I did this on the assumption that your code is actually using the ms_lookup() function. If it is not, it would be cleaner to adjust the code in the gist to not set $cmap->{' mstable'} and then it would be completely free of any innards knowledge.