Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous.

RT 113700 - Khmer script incorrectly rendered

  • 1 Replies
  • 1128 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 353
    • View Profile
RT 113700 - Khmer script incorrectly rendered
« October 20, 2016, 07:58:20 PM »
Tue Apr 12 05:28:49 2016 jeromekampot [...] gmail.com - Ticket created
Subject:    Khmer script incorrectly rendered
Date:    Tue, 12 Apr 2016 16:28:39 +0700
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Jerome B <jeromekampot [...] gmail.com>
 
It seems subscript (footer letters) are not rendered for Khmer script. PDF-API2 v2.0.27, perl 5.18.2

The code below should render 2 consonants, "under" each other but instead renders them next to each other with the "Coeng" placeholder. I tested with different fonts including KhmerOS.ttf (http://sourceforge.net/projects/khmer/files/Fonts%20-%20KhmerOS/KhmerOS%20Fonts%205.0-%20LGPL%20Licence/)

Copy/paste of the text seems to work correctly so I guess it has something to do with the way the font is generated.

Code: [Select]
my $pdf = PDF::API2->new(-file=>"testkhmer.pdf", -encode => 'utf8');
my $page = $pdf->page;
my $font = $pdf->ttfont('../font/khmerOS.ttf');

my $text = $page->text();
$text->font($font, 20);
$text->translate(200, 700);
$text->text("\x{1780}\x{17D2}\x{1780}");
$pdf->save();

<formatting cleanup - Mod.>
« Last Edit: May 01, 2017, 10:55:10 AM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 353
    • View Profile
Re: RT 113700 - Khmer script incorrectly rendered
« Reply #1: December 25, 2017, 06:06:21 PM »
 PhilterPaper commented on Nov 3

I don't know Khmer, so I'll add a "help wanted" label to this issue. I will attempt again to contact the original reporter, to see if they can try it.

The original code above (113700.pl) produces two KA's side-by-side, with a COENG under the left one. I have no idea if that's correct or not. Jerome says one consonant should be "under" the other -- does that mean "stacked vertically"?

I sort of copied a phrase ("a dog") from a Wikipedia article on the Khmer language, matching up characters as best I could. It seems to work as expected (vowel marks over or under the preceding consonant), but maybe this is something totally different from the COENG example.

Anyway, in a month or two I will close this issue, unless someone shows up who knows Khmer and how the text should appear.

113700.pl
testkhmer.pdf  <files not attached, obsolete>

 jbenezech commented on Nov 3

Thanks for looking into this Phil. I didn't expect this issue to be on anybody's list.
Khmer is written kind of horizontally, left to right, but has some tricks. I think the word Dog is a perfect example of these.

This word is composed of 3 letters:

  • Chha : consonant (\x{1786})
  • Ka : consonant (\x{1780})
  • Ae : vowel (\x{17C2})
   
You would read it something like Chkae
https://translate.google.com/translate_tts?ie=UTF-8&q=%E1%9E%86%E1%9F%92%E1%9E%80%E1%9F%82&tl=km&total=1&idx=0&textlen=4&tk=185153.308593&client=t

The written form obeys these 2 rules:

  • Vowel is placed before consonants
  • Each consonant can have 2 forms, the "normal" form and the "subscript" form. When two consonants form a consonant cluster, the second one takes the subscript form and is placed underneath.
   
In your attached example, the letter under the Chha is actually the vowel "ou" so this is not the word dog and I think is just not correct.

I have corrected the testcase. See attached. Note that I omitted the last 3 letters which are just the word for "a" (a dog).

I also attach a pdf with the correct rendering of the word. Note that I produced this pdf by copy pasting from testkhmer.pdf into LibreOffice writer then exporting as pdf.

Khmer - Dog.pdf
testkhmer.pdf
113700.pl   <files not attached>

 jbenezech commented on Nov 3

I played around a bit and tested with Tamil script. It seems to have the same issue. So I guess this is related to the Virama sign in general and probably affects most Devanagari-related scripts.

 PhilterPaper commented on Nov 4

Hi, and thanks for participating in this issue. Are you the original reporter of this problem?

The sample from Wiki was "dog a (one)", so that was my intent. The glyphs didn't seem to quite match what my Unicode book shows for Khmer, but it's close-ish.

What this all comes down to, is PDF::Builder outputting the correct sequence of bytes (and any surrounding information), and the problem is with various PDF readers messing up the presentation, or is PDF::Builder putting out incorrectly sequenced bytes (or other information) in the first place? I can hopefully do something about the latter case, but I can't do anything about the former. Are PDF readers depending on other information to tell them how to properly render this script, and it's missing or incorrect?

 PhilterPaper commented on Nov 4

Let me see if I understand what you sent. You revised 113700.pl to produce the correct form of "dog" (and dropped the "(one)" word). You took the PDF produced, which wasn't rendering correctly, ran it through LibreOffice and re-exported as a PDF that renders correctly? Does that imply that the byte order is correct, but there is something about the PDF file that's different? I will try to disassemble the PDF file to see how it compares to what is directly produced by PDF::Builder.

If the testkhmer.pdf you sent (revised) is correctly rendered for both words of text, does that satisfy the original complaint filed with bug 113700? If so, all I need to do is figure out what LibreOffice did to fix the PDF, and incorporate that into PDF::Builder. Or is it still incorrect in some way?

Update: I ran your revised 113700.pl to produce a new testkhmer.pdf, and it appears to render exactly the same as the testkhmer.pdf you attached above (no trip through LibreOffice). Please clarify what's in the files you sent.

 jbenezech commented on Nov 5

I am the original reporter indeed.

Let me try to be more clear.
Attached are 4 files:

  • khmer-fail.pdf : output of the sample perl script
  • khmer-fail.jpg : previous pdf saved as image
  • khmer-expected.pdf : pdf exported by libreoffice writer after copy/pasting the text from khmer-fail.pdf
  • khmer-expected.jpg : previous pdf saved as image
   
I have opened the failing pdf in several viewers/OS with the same display.

Since copy/pasting the text from the failing PDF seems to render the correct text in other editors, I would guess the byte order is correct. So there must be something different in the properties of the PDF file itself.

What I can see about the properties of the pdf font

Faling pdf:
KhmerOS
TrueType (CID)
Encoding: Identity-H
Embedded

Expected pdf:
Khmer OS
TrueType
Encoding: WinAnsi
Embedded Subset

Digging further, it seems that this might not be an issue with this library but possibly something to do with Ubuntu 14.04 as well. I ran a test in Python which produces the exact same (failing) result. Unfortunately, I don't have other OS I can test on at the moment.

I guess this might have something to do with CIDtoGIDMap but really have no idea what this should be or where it comes from.

khmer-expected.pdf
khmer-fail.pdf
khmer-fail       < as 4 glyphs>
khmer-expected  <as 3 glyphs>

 PhilterPaper commented on Nov 5

OK, I can see that the 'failed' PDF has 3 characters, and the 'expected' (correct) version combined the first two and moved them after the third.

I fed the two PDFs into WinMerge, and unfortunately it appears that LibreOffice did massive changes to the file. It's not simply a matter of a record or two added or deleted. I will look at disassembling both files to see how they differ, but this unfortunately promises to be a lengthy process. I'm hoping that in the end it's just a minor change to the PDF::Builder output, and that most of the differences are unimportant with regards to the rendering of the text.

Thank you for helping with this, and don't hesitate to add more information if you can. I don't know if I can get to this any time soon, but I'll try.

 PhilterPaper commented on Nov 5

This is going to be worse than I thought. I was able to uncompress the PDF files (deflate compressed streams) using PDFtk, but it looks like even that rearranged and renumbered a lot of stuff. I will try to find another way to uncompress everything so I can look at it. It's also possible that the uncompressed code was corrupted -- for instance, the section which maps a subset of the TTF shows only three glyphs, with a Unicode value of U+1786 for all of them (CHA). The content object (the actual text output) shows just three glyphs, but until I can decode the subsetting stream, I don't know which they are. Shouldn't there be four?

Also, the glyph order (given as four Unicode entries) is U+1786 (CHA) U+17D2 (COENG) U+1780 (KA) U+17C2 (AE). I see that this seems to be Left-to-Right on "failed" and Right-to-Left on "expected" (AE is on the right on "failed", and on the left on "expected"). Can you confirm that this is the correct ordering? Is this something to do with the syntax of the language, that glyphs are taken out of order, or is there a problem somewhere?

Is there any chance of your running both 113700.pl and LibreOffice without any compression (no "Flate" compression used)? I attach the 113700.pl file with compression turned off:
113700.pl.txt
LibreOffice messed with the text position on the page and the font size, so who knows what else it changed.

The only difference between "failed" and "expected" is that "expected" is "failed" run through LibreOffice?

 jbenezech commented on Nov 5

Please find attached

  • testkhmer.pdf : output of the script without compression
  • pdf-from-printer.pdf : I could not find a way to disable compression on libreoffice. I printed a text file to the system pdf printer which hopefully doesn't add too much garbage to the file
  • pdf-create.py.txt : a python script that demonstrates the same problem in python
  • pdf-from-pyhton.pdf : output of the python script
   
Regarding ordering, AE should be the last character in the unicode sequence but the glyph should be rendered first (as shown in the "expected" pdf).

testkhmer.pdf
pdf-from-printer.pdf
pdf-from-python.pdf
pdf-create.py.txt  <files not attached>

 PhilterPaper commented on Nov 6

Curiouser and curiouser. I don't have Python installed, so I can't run your script, but it appears to be using a different PDF library, and not PDF::Builder or PDF::API2. Maybe their library was translated from PDF::API2? Anyway, like PDF::Builder, it outputs four glyphs in the same order, producing the same "fail" result. Also note that it's producing PDF version 1.3 output, which may be missing a lot of functionality,

The printer version I could uncompress, and like your "expected" version, it has three glyphs output, so someone is doing some processing to consolidate and rearrange the glyphs, something which PDF::Builder may need to do. I can see the three glyphs are (in order) U+17C2 (AE) U+1786 (CHA) and U+FFFD. That last one may be a problem with decompression, as FFFD is normally "Invalid Character". I'm not sure what happened to U+17D2 (COENG) and U+1780 (KA)... did they get preprocessed and consolidated, or something else? I thought the whole intent was for the reader to do such processing. By the way, was this printer version output from LibreOffice, or from a PDF reader?

Regarding ordering, if AE (U+17C2) is the last glyph in the sequence, under whose rules is it moved to the front? Do vowels always get moved ahead of consonants? Again, that's something that should reasonably be left to the reader, but apparently this reordering needs to be done in producing the PDF! You mentioned that this seems to be a problem with other Devanagari-based alphabets in PDF::Builder.

I'll keep chipping away at this, but for the moment I'm stuck. The first thing is to find a clean decompression (deflate) utility to uncompress the Flate-compressed streams. PDFtk is doing a lot more than just that, and I fear it may be breaking something in the process. I don't need a working PDF created, just something with human-readable uncompressed streams. Finally, I appreciate your efforts on this, and welcome any further help you can give.

 PhilterPaper commented on Nov 6

Attached below is a dump of the KhmerOS.ttf font:
022_truefonts.KhmerOS_.pdf
113700.pl.txt

It shows the desired KA.sub character at G+382. My suspicion is that there is logic in LibreOffice and elsewhere that knows if it sees COENG and a following character (e.g., KA), to get the glyph name of the that character ("uni1780"), add ".sub" ("uni1780.sub"), get the CID (382 decimal, 017e hex), and replace the COENG and KA code points with this CID. It may not be that simple in general (there are other subscripts that can be added to base characters). Anyway, I manually edited the testkhmer.pdf file produced by 113700.pl, and it fails to show KA.sub. Maybe it needs to be explicitly listed in the mapping, or maybe the zero width is suppressing it.

The AE vowel is shown as combining to the right, so why is it to be specified as the last Unicode point? Shouldn't it be specified first? Or is it intended to combine with the consonant given immediately before it? I tried AE as both the first glyph code, and the last. I'm confused.

 PhilterPaper commented on Nov 8

I've made some progress, but it's very, very ugly code. Attached are

  • a revised PDF/Builder/Resource/CIDFont.pm which looks for a COENG (U+17D2) and a following consonant or independent vowel (U+1780 .. U+17B3) and changes the generated CID string to replace the pair of CIDs (G+nnn) with a new CID of the subscripted character.
  • a revised 113700.pl to generate a test PDF with 3 examples
  • testkhmer.pdf output of 113700.pl
   
CIDFont.pm.txt
113700.pl.txt
testkhmer.pdf  <files not attached>

This is hard coded to work with a narrow range of Khmer alphabet characters. At this point, I want to see if I'm more or less on the right track, before doing a lot more work. I'm not sure if it will work for other Khmer TTF files (if they have different CID assignments), how subscripts etc. other than "*.sub" names work, and I need to write a name-to-CID function for PDF::Builder. The AE vowel needed to be manually moved to before the consonant -- what are the rules for that (it needs to be automated)? And of course, more work would have to be done for other Devanagari-family scripts.

Anyway, please temporarily replace your CIDFont.pm file with the new one, and try it out on some Khmer text. I will need to know if the rule set needs to be greatly extended to handle all the other .sub.alt, .a, .sub.a, .sub.alt.a, .au, .sub.au, .sub.alt.au, .sub.alt1, .sub.alt2, .alt1, .sub.alt3, and 4 or so special cases. That doesn't even consider whether CIDs (numbers and names) are constant across different font files.

I'm working on the assumption that LibreOffice, etc. are doing something like this substitution and rearrangement internally, rather than the PDF reader doing it. That could explain why (for "dog" example) that there are only 3 glyphs being output, rather than the 4 you would expect naively changing Unicode points to CIDs.

Thanks!

 jbenezech commented on Nov 9

I can confirm that the subscript is now rendered correctly.
AE is still misplaced on the right side using the initial test script but as you mentioned, you did move it manually in the second line of your latest script. That second line is now renders the word "dog" correctly but AE should not have to be manually moved.

I tried a java script using iText with same failed result so this seems to be a very common issue. Although iText seem to say they support devanagri scripts in their commercial edition.
Found this link here http://palashray.com/making-itext-work-with-indic-scripts/ which seems to indicate that there should be a glyph substitution table within the font file. Could it be that PDF:Builder skips reading this table ?

 PhilterPaper commented on Nov 9

If this stuff is built into the font, all the better. I don't see anything in PDF::Builder that claims to be doing "glyph substitution" or anything similar, so it probably doesn't read or process this table. I will look around for documentation on this. If you happen to know a good online source for documentation (and even better, some code), I'd appreciate hearing about it.

The article about iText seems to be talking about ligatures (substituting one glyph for a stream of two or more glyphs: AB -> C), and not subscripted glyphs, although it might be applicable to that, too (not a ligature: A+B -> Ab, where + is COENG).

Can you point me to any documentation on these tables, as well as rules for moving the AE vowel? Is it always a move left by one glyph? It seems silly to put it after the glyph in the stream, but expect it to render before that glyph. It would be simple enough to look ahead for AE and any other vowels to which this applies.

If a given font does not contain glyph substitution rules, would it be reasonable to hard code fallback rules? If so, what are all the rules we would need to support? I suspect that there will be a lot of them, and not just for Khmer. If full support for glyph substitution is common, special code like I just wrote could just be left out (handle as is done today) if the font lacks the table(s).

Add: I've done some searching, and I see lots of description of external files to define ligatures and other operations you want to do with glyphs, but so far, nothing built into a font. Still looking.

 PhilterPaper commented on Nov 10

I'm making some progress here. I found that the GSUB rules are in the font file, but need to be explicitly read in to a huge hash array. Then I have to figure out how to interpret all these rules and lists of affected glyphs, etc. They're huge, and quite complicated. I have found the sections that handle the consonants and vowels with COENG, and I think I understand what they're doing, so I can get the same results as my code of a few days ago. Hopefully there are rules in there about moving vowels like AE.

Add: I don't see anything that looks like a simple move of AE from the right side of a consonant to the left, at least not in the GSUB (glyph substitution) data. Do you know if it might be lurking somewhere else in the font? Perhaps in a GPOS table somewhere?

If I go ahead and read the GSUB data for all TTF files, I have to see if there are any cases where we would not want to implement some of the rules (e.g., always replacing f and i by fi for Latin alphabet text).

Add: This might be something like $font->glyph_sub(1); to turn on glyph substitution when desired. You would always do this for Devanagari family languages, but optionally for others.