Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

[RT 128674] error "requested cmap '' not installed" with many CJK fonts

  • 49 Replies
  • 2041 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
jgreely (J Greely) replied:

Playing with adobe's perl scripts (in particular, I rewrote my quick hack to use the output of unicode-list.pl -g), it looks like:

  • an "identity" cmap is a framework for building mappings, not an actual mapping,
  • glyph IDs are not a continuous range from 0..$num_glyphs,
  • A single glyph ID can map to multiple Unicode code points. For instance, in SourceHanSans, glyph 22397 => [0x6A02, 0xF914, 0xF95C, F9BF].

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
So where does this leave us? Is a .cmap file applicable only to TrueType/OpenType font file(s) that use (or permit) that particular mapping? If that's the case, this is big trouble. Do TTF/OTF contain the Unicode point itself (if there is one) for each glyph ('cmap' field)? If we can read in the Unicode value from the font file, we can dispense with the .cmap files, but then you need to have the TTF or OTF file in hand to generate the PDF (was that why .cmap was handled as a separate file?).

It is true that there is not a one-to-one mapping between glyphs and Unicode points. There are many ligatures, swashes, and other such typographic effects which have no Unicode (or map to multiple Unicode, such as a 'ct' ligature to [0x0065, 0x0074]). Unless the PDF author can specify by Glyph ID (CID), which will be font-specific, they won't even be able to put such glyphs into the PDF (that sounds like a possible PDF::Builder enhancement, to specify glyphs by ID, including choosing swashes). Note that even if an author can specify a character by glyph (font file specific), or automatic substitution can be made (such as ligatures), the Unicode value(s) are still needed so that the PDF can be searched. Online documentation for 'cmap' suggests that even within a font file, there may be multiple cmaps (mappings between Unicode and CIDs).

The problems with "Identity" aside, can anyone tell if the four .cmap files currently supplied with PDF::Builder (and PDF::API2) are still valid, and for all font files claiming to use those mappings?

It's beginning to sound like this whole thing is fast becoming a big mess. I have a tendency to overthink these things and go down the rabbit hole of getting too complex a solution, so I'm hoping there is something better to do. Alfred, care to chime in on the subject, as you know the history of it?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
jgreely (J Greely) said:

Does Font::FreeType provide everything you need?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Font::FreeType and Font::FreeType::Glyph don't appear to offer anything I need (unless I'm badly misreading their descriptions). Maybe Font::TTF::Cmap will do the job? Font::TTF is already a prereq.

What we're starting with is the Unicode value of a character we wish to output, and at some point in the process we need a CID/Glyph ID to tell the font exactly which one to use (the CID is what's output in the PDF file). For now, let's ignore the issue of automatic ligatures and other substitutions (GSUB) and positioning (GPOS) vital to Indic (#35) and Arabic family languages, optional ligatures for English, etc. (and their suppression), the choice of swashes and other glyph variants, and the use of font-specific glyphs with no Unicode value. Obviously, the encoding you use for your text needs to be one that has the correct mapping available, whether it's an external (xxx.cmap) or internal (cmap table). Once the appropriate glyph is found, its metrics (width) can be obtained.

I'm hoping that the Unicode information (mapping) I need can always be found in the font file, and we fall back to the .cmap files only if the font file is not at hand during the production of the PDF. In that case, where font metrics come from in that case is an interesting question -- is it assumed that it's a fixed-pitch font? Would it be a great hardship to eliminate .cmap files and require that the font be present for the PDF writer?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
jgreely (J Greely) said:

How would you even generate a PDF if the font file wasn't available when the script ran?

Answering the earlier question about the coverage of Noto Sans JP, Google has multiple packaging options for Noto, but the one I picked turned out to be the most generically compatible option.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Quote
How would you even generate a PDF if the font file wasn't available when the script ran?

It's done all the time with core files, which are usually TTF behind the scenes. PDF::Builder supplies its own mapping (Latin-1) and metrics files, and it is assumed that the appropriate file will be found on the reader's machine. I haven't gone through the deep details, but my understanding is that the writer doesn't read (or embed) the font file.

For TTF/OTF, Type1, etc., I'm not sure if they actually require the font to be present on the machine writing the PDF. Possibly they do, otherwise where would they get the metrics?

In general, font handling is a mess. It would be nice to have (at least) TTF/OTF and Type1/PS handled identically, or at least as close to that as possible. Core should probably not be used for serious publishing work, as the encoding is limited and metrics may not match up. Bitmapped (BDF) is probably only for novelty effects.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
jgreely (J Greely) said:

Well, yes, but corefont() doesn't matter for this issue, because it's guaranteed that a PDF reader has a metrics-compatible font in some format; that predates Acrobat 1.0 from when they were the PostScript core fonts that shipped with every printer. I don't remember which version of Acrobat first bundled the fonts used by cjkfont(), but it was back in the Nineties.

If you call PDF::API2::ttfont(), you are definitely loading a complete font file from disk:
Code: [Select]
sub new {
    my ($class,$pdf,$file,%opts)=@_;
    my $data={};
    die "cannot find font '$file' ..." unless(-f $file);
    my $font=Font::TTF::Font->open($file);
    ...

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
OK, but to get the discussion back on track,

  • can we come up with generic *.cmap files (especially identity) that are guaranteed to work?
  • if not, can we read any TTF/OTF file to get its mapping(s), and compare to what the PDF writer is claiming for the encoding used? This would generate u2g and g2u lists on the fly. Do Type1 font files include this information?
   
Core fonts aside (whose use should probably be discouraged), it looks like other font methods need to have the font files present anyway at PDF generation.

Most TTF/OTF seem to have Unicode mapping(s) available, so it might be that we have to convert any single-byte encoded text to UTF-8 on the fly. We just have to be careful to leave room for GSUB (many-to-one Unicode-to-glyph possible) and GPOS work, optional ligatures and swashes, etc., and make sure the correct text is available for searching (e.g., 'ffl' ligature searched as f+f+l). Embedding fonts is good, too (since synthetic fonts can be embedded, perhaps Type1 can be too?).

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
jgreely (J Greely) said:

  • No, I don't think so. Apart from identity, the old ones seem to work well enough (I've printed entire Japanese novels with PDF::API2 without spotting any incorrect kanji mappings), so maybe the short term solution is simply to provide a better error message for anything that falls through to "identity".
  • Type 1 fonts had a relatively small number of well-defined encodings, with all the interesting goodies stored in separate AFM/TFM/XFM files (and all sorts of workarounds for the limitations that imposed). I'm not even sure what box my old PostScript manuals are in these days, so I can't be more specific. Fortunately that's a completely different code path that doesn't involve g2u/u2g, and anyone wanting to use Unicode text won't miss them.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
terefang (Alfred Reibenschuh)  said:

new resolution proposal:
Code: [Select]
if (defined $data->{'cff'}->{'ROS'}) {
to
Code: [Select]
if ((defined $data->{'cff'}->{'ROS'}) && ($data->{'cff'}->{'ROS'}->[1] ne 'Identity')) {
then if identity pops up, the opentype cmap will be used instead.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
jgreely (J Greely) said:

That works for SourceHanSans, and at least gets NotoSansJP to render alphanumerics (but not kanji).

Out of curiosity, I tried the NotoSansCJKjp packaging, and that one does work, including kanji.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Hmm. Looking at (Alfred's revised) code, is it possible that there was no real 'cmap' in the file, and thus u2g and g2u were not populated? That is, it was depending on there being an identity.cmap of some sort? Or perhaps there were other version(s) of the cmap than the expected MS version (->find_ms())?

Certainly, we could restrict .cmap usage to the four currently supported versions, and otherwise look for the font file's cmap section(s), but apparently we can't depend on find_ms always working? If there is no MS cmap, what other ones should we look for? At the least, we could check that u2g and g2u of some sort were successfully loaded, before continuing.

Add:
First, a correction. Apparently (according to the Font::TTF::Cmap documentation) find_ms() looks first for the MS cmap, and then tries others. If there are other cmaps, it should use one of them.

I did some playing around, dumping cmap data in TrueType/FontFile.pm in the non-.cmap section. My file NotoSansJP-Regular.otf has 5 tables in the cmap section. The first 3 are Platform 0 and the last 2 are Platform 3. All are Ver 0. The encodings are 3,4,5,1,10. ms_enc() reports that the encoding is 10, so apparently it went with the fifth table.

Dumping out each u2g entry as it was created, there are a lot of gaps in the Unicode sequence once you get past Latin-1, but the Glyph IDs increase fairly steadily (although I did see some out-of-sequence). There are many Unicode entries in 1F1xx and 2xxxx ranges. All told there are 16697 entries, both by count and getting the size of u2g. I don't know if there are any duplicate glyphs or if they're all unique.

P.S. There is a third ROS array value, some sort of Revision number or something. Currently that's ignored when deciding to use, say, japanese.cmap. I'm wondering if using a later (or just different) revision number might cause problems.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
terefang (Alfred Reibenschuh) said:

i have done some further research (reading otf spec, dumping/diffing fonts cmap).

seams like the age of Font::TTF is showing ...

find_ms is picking the Windows NT cmap (1 = Unicode 2.0 BMP) or MS Unicode (10 = says Unicode full complement).

the cmaps 3:10 and 0:4 from noto are identical, but i would prefer 0:4 and 0:5 over 3:10 for unicode context.

also Font::TTF (as of 1.06) does not support format 14 cmaps (notos 0:5) from the otf spec.

the otf spec still refers to unicode 2.0 -- :-((

so actually two things need to happen (in that order)

1 - patch to builder and API2 to switch to font cmap if cff encoding is identity (as above).
2 - patch Font::TTF:Cmap find_ms to prefer (platform:encoding) of 0:4 instead of 3:10/3:1.
3 - patch Font::TTF:Cmap to support format 14 cmaps and add preference for 0:5 above all.

c

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
Alfred, are your observations on preferable choices based only on looking at Noto fonts, or are they universal? We don't want to change PDF::Builder and/or Font::TTF based only on Noto (which could turn out to be an odd duck). If it's Noto-specific, perhaps an option could be added to ttfont() to prefer different cmaps (and/or .cmap files) based on the specific font you're using?

  • No problem, but I'm becoming concerned about the age and currency of the .cmap files. Most of them have later revisions out (I'm assuming ROS[2] is a revision number of some sort). Should we consider preferably using the font's built-in cmap, and using the .cmap files only has a fallback? Why are there .cmap files in the first place, and do they still serve a purpose?
  • If this is universally good, and you can get Bob Hallissy et al. to quickly change Font::TTF, that would work. Otherwise, I think there is enough information in the font file to bypass find_ms() and use our own code to do what you want (find the appropriate cmap, and populate u2g and g2u).
  • I have no idea what's involved with this. Again, it's probably possible to bypass Font::TTF functions and work directly with the font data, if Bob can't provide a quick update.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 601
    • View Profile
terefang (Alfred Reibenschuh) said:

Quote
Alfred, are your observations on preferable choices based only on looking at Noto fonts, or are they universal? We don't want to change PDF::Builder and/or Font::TTF based only on Noto (which could turn out to be an odd duck). If it's Noto-specific, perhaps an option could be added to ttfont() to prefer different cmaps (and/or .cmap files) based on the specific font you're using?

the observations are universal based on the otf-spec 1.8.3 (https://docs.microsoft.com/en-us/typography/opentype/spec/)

the real problem is that this breaks down based on the particular font-files themselves, since some are highly opinionated relative to one ttf/otf-spec or the other, based on apple or windows with unix/linux in between, or what works best observations on various points in time from 1991 to today.

Quote
No problem, but I'm becoming concerned about the age and currency of the .cmap files. Most of them have later revisions out (I'm assuming ROS[2] is a revision number of some sort). Should we consider preferably using the font's built-in cmap, and using the .cmap files only has a fallback? Why are there .cmap files in the first place, and do they still serve a purpose?

  • ROS[2] definitely is a revision number.
  • since cff/Indentity fonts where not even supported by the old code, switching to a fonts internal cmap on ROS[1] == 'Identity' is the about best thing you can do to improve the situation.
  • the .cmap files should be updated to their latest revision, since they are still needed to support legacy fonts and cff-fonts without an internal cmap. AFAIK each revision build on top of the former so higher revisions should be compatible with lower revisions besides the actual glyph-counts.
Quote
2. If this is universally good, and you can get Bob Hallissy et al. to quickly change Font::TTF, that would work. Otherwise, I think there is enough information in the font file to bypass find_ms() and use our own code to do what you want (find the appropriate cmap, and populate u2g and g2u).
3. I have no idea what's involved with this. Again, it's probably possible to bypass Font::TTF functions and work directly with the font data, if Bob can't provide a quick update.

i have tested Font::TTF 1.06 and besides not supporting the format 14 cmap it is as good as you may want it.

i have looked at the format 14 spec and it deals largely with variant glyphs which could impact cjk and/or arabic probably some more, but i am not an expert on this.

with the identity fix in place, the general assumption can be made that newer cff-fonts will be better supported as they seam to have internal cmaps, so you should be fine unless someone starts using the cff2 format (which is currently only an apple thingy).

ttf/otf technology stopped to be an exact science at the point adobe, ms, and apple shared the spec.