Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

[RT 128674] error "requested cmap '' not installed" with many CJK fonts

  • 49 Replies
  • 1546 Views
*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
PDF::API2 version 2.033
Font::TTF version 1.06
Perl version 5.20.2

Sample code:
Code: [Select]
#!/usr/bin/env perl
use PDF::API2;
$pdf = PDF::API2->new();
$pdf->page();

$pdf->ttfont("NotoSansJP-Medium.otf");
# downloaded from here:
# https://github.com/googlei18n/noto-cjk/blob/master/NotoSansJP-Medium.otf

dies with:

Quote
requested cmap '' not installed at /Users/Shared/perlbrew/perls/perl-5.20.2/lib/site_perl/5.20.2/PDF/API2/Resource/CIDFont/TrueType/FontFile.pm line 27.

About a third of the CJK fonts I have don't work with PDF::API2. I used Noto Sans JP in the above example because it's freely available, but it's not the only one, and at least one other PostScript-flavored OTF font that I own does work (DFKyoKaShoStd-W4.otf). Is there a workaround or a fix?

-j

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
In FontFile.pm (as well as CJKFont.pm, if that one is called), the ROS (argument to _look_for_cmap()) is 'Adobe:Identity'. This is not one of the four supported CMAPs: Adobe:Japan1, Adobe:Korea1, Adobe:CNS1 (traditional), and Adobe:GB1 (simplified). https://github.com/adobe-type-tools/cmap-resources seems to contain the up-to-date CMAPs, so I'll have to take a look at what's involved in updating PDF::Builder's list of CMAPs from there (open source, just have to keep the copyrights). At the least, add Adobe:Identity, and possibly some others (as well as updates to the existing four). There's a lot of stuff there, so I want to understand what they are before I bloat the size of PDF::Builder with unnecessary CJK-related files. At the least, this is adding some more files to the CMAP directory, and updating the internal lists of available CMAPs in FontFile.pm and CJKFont.pm.

We may want to consider auto-generating the list of available CMAPs from reading the CMap directory (on the fly), so that users can add whatever CMAPs they want to that directory, rather than shipping PDF::Builder with everything under the sun.
 

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
There are some Perl tools for dealing with the CMap files at https://github.com/adobe-type-tools/perl-scripts, however, the output still doesn't look anything like the "cmap" files shipped with PDF::Builder. The PDF::Builder files contain Perl code mappings of Unicode-to-glyphID and vice-versa, while the Adobe files seem to be just CID ranges, and are in some sort of PostScript format (even after "conversion" with cmap-tool.pl). I just don't see anything listed among the tools that claims to turn the file into a PDF::Builder-compatible cmap file. So, I'm just going to have to declare myself stuck on this one until someone familiar with CMaps comes along and can explain what such a file is supposed to be. There doesn't seem to be the information needed for the CMap file (Unicode to/from GlyphID) in the Adobe files. Perhaps it can be gotten from a font in some manner?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Date: | Sun, 3 Mar 2019 14:23:23 +0100
From: | "Alfred Reibenschuh" <alfredreibenschuh [...] gmx.net>

replace the section

#/lib/PDF/API2/Resource/CIDFont/TrueType/FontFile.pm in sub new
#-------------------------------------------
Code: [Select]
if(defined $data->{cff}->{ROS}) {
    my %cffcmap=(   'Adobe:Japan1'=>'japanese',
                    'Adobe:Korea1'=>'korean',
                    'Adobe:CNS1'=>'traditional',
                    'Adobe:GB1'=>'simplified',
                    'Adobe:Identity'=>'identity', # NEW CMAP
    );     
    my $ccmap=_look_for_cmap($cffcmap{"$data->{cff}->{ROS}->[0]:$data->{cff}->{ROS}->[1]"});       
    $data->{u2g}=$ccmap->{u2g};
    $data->{g2u}=$ccmap->{g2u};
} else

#-------------------------------------------

and create a new cmap file:

#/lib/PDF/API2/Resource/CIDFont/CMap/identity.cmap
#-------------------------------------------
Code: [Select]
$cmap->{identity}={
    'ccs' => [
         'Adobe',      # registry
         'Identity',    # ordering
         0,               # supplement
     ],
     'cmap' => { # perl unicode maps to adobe cmap (TBD)
         'ident'         =>  [
              'ident',
              'Identity'
         ],
     },
     'g2u' => [
         0x0000 .. 0xffff
     ],
     'u2g' => {
         map { $_ => $_ } (0x0000 .. 0xffff)
     }
 };

#-------------------------------------------

it could be that the g2u/u2g data does not work and you have to include the raw array and map instead.

Alfred Reibenschuh

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
Sun Mar 03 11:29:42 2019 PMPERRY@cpan.org - Correspondence added

Hi Alfred,

Good to see you're still active in this area!

Unless "Identity" is a very different beast than the other CMaps, I suspect that this won't work. For example, u2g[x20] = space = GId[1], and the map is not monotonically increasing. Can someone show that it does work?

Per https://github.com/PhilterPaper/Perl-PDF-Builder/issues/98, I have found some CMap sources and tools from Adobe, but they don't seem to have the u2g/g2u information needed for the .cmap files used here. Do you have an idea of how to generate .cmap files? I'm thinking in terms of a tool that would take an Adobe CMap file that someone wants to use and generating a .cmap file from it. In any case, it would probably still be a good idea to add Adobe:Identity to the list.

regards, Phil Perry

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
terefang (Alfred Reibenschuh) replied:

not that active but this got me interrested because i had similar problems in java.

i have had a look at: https://github.com/adobe-type-tools/cmap-resources/blob/master/Adobe-Identity-0/CMap/Identity-H

which sugested my fix but dumping the cmap table from noto suggests otherwise.

hmm ... the original code was written under the premise that cff/otf files did not contain a cmap but have everything stuffed into the cff table.

for building cmaps i suggest reading: http://blogs.adobe.com/CCJKType/2012/03/building-utf32-cmaps.html

a more simple fix would be checking for Adobe-Identity-0 and the existance of a cmap and use that instead if present.

-- Alfred

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
OK, I can see that the "cidrange" blocks are giving the Unicode ("u") and equivalent CID/GlyphID ("g") values. However, that's not usable for PDF::Builder, as currently implemented. It needs explicit u2g[] and g2u[] mappings, and in Perl code. I kind of hate to use a preprocessor to expand those tens of thousands of entries into explicit u2g and g2u entries, although it could be done. I wonder if it might be better to handle "Identity" as a special case, where u=g (or whatever the proper mapping turns out to be), and g=fn_u2g(u) and u=fn_g2u(g) rather than using array lookups.

I haven't tried your code yet, but I'm worried about whether "Identity" actually has G+0 = U+0000, G+1 = U+0001, etc. As for the other CMaps, G+0 is .notdef, and the mapping to ASCII starts with G+1 at U+0020. "Identity" would indeed be a horse of a different color if it is this way. I guess there's only one way to find out for sure!

Existing .cmap files suggest that there is really nothing to be gained by replacing them with such functions, as there is very little pattern to u2g and g2u, and the functions would be just as large as the present tables. In that case, a one-off identity.cmap in the usual format, along with a tool to convert online CMap resources to .cmap files, might be just as good (rather than adding code to treat Identity as a special case). I'll have to think about it -- it might work if PDF::Builder scans for .cmap files during startup, and builds the ROS[0]:ROS[1]-to-filename list. That way, someone could add more .cmap files if they need them.

I will have to look what the "cmap" element is doing. It includes an "Identity" component, but I'm guessing that there's nothing implemented. The whole element is marked "TBD".

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
jgreely (J Greely) replied:

FYI, I tried making Alfred's suggested change in PDF::API2, and it didn't work. NotoSansJP-Medium.otf loaded, but every character was shifted by 31 ("11" became "PP", etc), and the resulting PDF rendered very slowly.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
That's off by decimal 31? That could be accounted for by G+1 should map to U+0020, etc. What happens if you change Alfred's code to
Code: [Select]
     'g2u' => [
         0xFFFD,
         0x0020 .. 0xFFFF
     ],
     'u2g' => {
         map { $_ => ($_-31) } (0x0020 .. 0xFFFF),
         0xFFFD => 0
     }

? I haven't actually tried it, but the syntax should be close.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
jgreely (J Greely) replied:

That's sufficient to fix ASCII, but completely wrong for kanji (月 = U+6708 renders as 玞 = U+739E). Still renders incredibly slowly as well (about 10 seconds in MacOS Preview.app compared to the similarly-sized DFKyoKaShoStd-W4.otf, which comes up instantly; PDF size is about the same).

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
So does this Noto Sans JP cover just ASCII and CJK (or even just Japanese alphabets), rather than all scripts? In other words, just a subset of all characters? I thought the intent of Noto was to provide glyphs for every Unicode character, in order to avoid tofus. Maybe that was just too big a font file. And why is it claiming "Identity" mapping 0-ffff:0-ffff if that's not what it's providing?

Let me try to take a look at the contents of this font file and see if my font-dump routines (in examples/) give any clue as to what's going on.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
jgreely (J Greely) replied:

FontExplorer Pro says it has 18,570 characters and covers Cyrillic, Hangul, Bopomofo, Greek, IPA extensions, etc, so most likely it's just using the Japanese flavor of CJK glyphs. Looks like they build separate localized fonts for each target country.

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
OK, I finally got examples/022_truefonts to dump all the glyphs in the NotoSansJP-Regular.otf font file. It reports 17802 glyphs. They are rather scattered around, with large gaps between some CJK characters (when ordered by CID), and others in large contiguous chunks. They do not appear to be in any order per the Unicode standard. For example, I see Katakana (U+30Dx neighborhood) appearing near the end at around G+65300 (0xFF10).

I must conclude that either 1) this Noto file is in fact not ordered in some sort of "Identity" one-to-one mapping, or 2) Alfred's .cmap file is missing something or broken, or 3) I/we just don't understand what "Identity" is supposed to be and do when it comes to CMaps.

Since Unicode does not define every single code point between U+0000 and U+FFFF, I would expect gaps in the CID (G+nnnnn) sequence, or possibly some sort of dummy placeholders in the gaps. If this font is claiming almost 18,000 glyphs, that should cover much of Unicode, but I think there should probably be many more than that. My Unicode 3.0 book claims 49,194 characters (I'll take their word for it -- I'm not going to count the damned things).

Anyone out there have any ideas on where to go from here?

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
jgreely (J Greely) replied:

Note: same error with Adobe's free SourceHanSans-Medium.otf.

I'm just stabbing blindly here, but it looks like you need to parse the cmap table directly to get correct results.

I've attached a crude Perl script that parses the output of ttx from Adobe's Font Development Kit and generates a suitable identity.cmap for replacing the stub Alfred suggested. This mostly works with NotoSansJP-Medium.otf and SourceHanSans-Medium.otf, but has to be generated for each font; they're quite different internally. Interesting, in both cases, the character that comes out wrong is 金 (U+91D1) (my test script generates locale-specific calendars, so the only kanji are the days of the week; I'm sure there are a lot of other CJK errors I'm not seeing, but 6 out of 7 worked).
Code: [Select]
ttx -q -t cmap -o - NotoSansJP-Regular.otf 2>/dev/null | ./cmap2perl.pl > identity.cmap
Code: [Select]
#!/usr/bin/env perl

use strict;

my %g2u;
my $incmap = 0;
while (<>) {
if (m|<cmap_format_4 platformID="0"|) {
$incmap++;
next;
}elsif (m|</cmap_format_4>|) {
last;
}
# <map code="0x29" name="cid00010"/><!-- RIGHT PARENTHESIS -->
my ($code,$id) = m|<map code="0x([0-9a-fA-F]+)" name="cid(\d+)"|;
$g2u{$id} = sprintf("0x%04X",hex($code));
}

print<<'EOF';
$cmap->{identity}={
    'ccs' => [
         'Adobe',      # registry
         'Identity',    # ordering
         0,               # supplement
     ],
     'cmap' => { # perl unicode maps to adobe cmap (TBD)
         'ident'         =>  [
              'ident',
              'Identity'
         ],
     },
     'g2u' => [
EOF
foreach my $id (sort {$a <=> $b} keys %g2u) {
print "        $g2u{$id},\n";
}
print <<EOF;
     ],
     'u2g' => {
EOF
foreach my $id (sort {$g2u{$a} cmp $g2u{$b}} keys %g2u) {
printf("        '%d' => '%d',\n",oct($g2u{$id}),$id);
}
print <<EOF;
     }
 };
EOF

*

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 572
    • View Profile
If there is no one 'identity.cmap', but a new one has to be generated for each font file, well, that's absurd. We might as well just supply some tools with PDF::Builder and tell users to build their own .cmap files. I wonder if it would be better to just get rid of the whole .cmap file business. All it seems to be is a mapping of Unicode number to CID (u2g) and the corresponding inverse (g2u). Isn't all that information available in a TTF or OTF font file anyway? Or is the Unicode number for each glyph missing in at least some font files? Was the assumption that a "producer" might not have the appropriate font file in hand? If so, where do character widths and other information come from?

It makes me wonder if the supplied Japanese, Korean, and (two flavors of) Chinese .cmap files have errors in them, especially when applied to fonts following later revisions of the CMap standards (e.g., japanese.cmap is Rev 6, while the current is Rev 7). If they're "clean", why is Identity so fubarred? Note that even Latin-1 (Windows-1252/standard encoding) doesn't get done correctly for NotoSansJP in examples/022_truefonts!
« Last Edit: March 06, 2019, 08:33:32 AM by Phil »