Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

CTS 13 - Small Caps missing for some ligatures

  • 1 Replies

Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 582
    • View Profile
CTS 13 - Small Caps missing for some ligatures
« December 25, 2017, 04:20:31 PM »
 PhilterPaper commented on Oct 24

Ref RT 120048 (#47) discovery that some lower case characters (mostly ligatures) do not appear to have upper case equivalents. When "small caps" are used in synthetic fonts (-caps => 1), these are unchanged. So, a word like "field" would be in small caps "fiELD" (imagine the "fi" is the U+FB01 f+i ligature). Presumably, the desired behavior (since there is no capitalized version of this ligature) would be to replace the "fi" ligature by "f" and "i", and then small caps that. I can only speak for ligatures based on the Latin alphabet, and even then, I'm not sure about some of them (what their capitalization rules are). There are a few single characters too, which currently may or may not properly uppercase, and here they are grouped with ligatures.

Right now, it's up to the user (provider of the text) to be aware of whether they are using these ligatures in their text when they are going to display small caps. One choice would be to provide a translation utility to scan the text for certain code points, and replace them by pairs or triplets of characters. The user of PDF::Builder would have to manually call this utility routine everywhere they are feeding text that might contain such lower case ligatures to output using a small caps synthetic font.

More automated would be to update the output routines to scan and translate on-the-fly, once they know that a small caps font is being used. We would have to be careful to catch all uses of a font, such as the advancewidth() routine, so that proper character widths are used. Finally, we could look at the actual internal structure of the small caps font, and update the font to use pairs or triplets of characters when such ligatures are encountered in the input text. This is probably the most complicated method, but would have the highest performance, and would treat such ligatures as just like any other. For example, the "ij" ligature already small caps as "IJ", so this has been done before.

U+00DF ss or sz ß SS German sharp s (eszett), resembles Greek beta. U+1E9E ẞ may be an acceptable upper case version (still single glyph), rather than using a double-S, but is still uncommon in fonts
U+0149 'n ʼn 'N upper case is 'N (two glyphs U+02BC U+004E) (Afrikaans, use discouraged). Capitalization rule in Afrikaans is a bit complicated
U+017F f ſ S actually a "long s" (looks like "f" without the crossbar). Technically not a ligature
U+FB00 ff ff FF
U+FB01 fi fi FI
U+FB02 fl fl FL
U+FB03 ffi ffi FFI
U+FB04 ffl ffl FFL
U+FB05 ft ſt ST actually a long s, not an f
U+FB06 st st ST short s plus t
U+A733 aa ꜳ AA Ꜳ
U+A735 ao ꜵ AO Ꜵ
U+A737 au ꜷ AU Ꜷ
U+A739 av ꜹ AV Ꜹ
U+A73B av-bar ꜻ AV-bar Ꜻ
U+A73D ay ꜽ AY Ꜽ
U+1F670 et 🙰 ET
U+A74F oo ꝏ OO Ꝏ Massachusett language
U+A729 tz ꜩ TZ Ꜩ used in German
U+1D6B ue ᵫ UE
U+A761 vy ꝡ VY Ꝡ

There are others, but are very font and language (orthography) dependent. Note that the Greek "final sigma" (terminal sigma) maps to upper case Sigma, as does sigma. Dotless i and j uppercase to various accented forms of I and J. The Dutch 't and 's, like the Afrikaans 'n, are normally not capitalized (except when the entire word is capitalized). See and similar articles for more than you would ever want to know about capitalization rules. They are inconsistent and complicated enough across languages that it may not be worth trying to fully automate them (e.g., title and sentence caps), but still, it would be jarring to see lower case letters (or ligatures) mixed in with capitals/small caps when you have requested capitalization.

While we're on the subject, we might also want to check if a given font contains the requested ligature, and if not, replace it by the (lower case) appropriate characters. This could blend into the request (#56) for fallback fonts for missing glyphs... to either replace them with separate characters, or use a different font that does contain them. This might even be needed for ligatures (such as "oe" or "ij") that normally have upper case equivalents. Finally, uppercasing in general of such ligatures, not just for small caps, would be of interest.

This concerns only the uppercasing of certain lower case Latin alphabet ligatures (once the decision has been made to use them in the text source). Whether it is appropriate to replace a pair or triplet of lower case letters with a ligature depends upon the language, the font being used, and the word itself. For example, in English orthography, a "shelfful" of books should not use the "ff" ligature, while a "waffle" could use the "ffl" ligature. PDF::Builder should probably leave this to the author of the text. Even an automated search for ligature candidates would have to have an exclusion list (e.g., shelfful) and be aware of the language being used, subsets of that language and where a ligature would be appropriate, and the font in use (what ligatures are supported).

 PhilterPaper commented on Nov 3

Some further thoughts...

  • We need to see if ->('upper'} is replacement text (string), or just points to another code point index. If it's a string, it might be possible in the Small Caps code ( to check if upper is not defined for certain ligatures, etc., and add one (e.g., eszett create 'upper' = 'SS'). If 'upper' merely points to another Unicode point, we would probably have to create a full entry for the new 'upper'. Check to see if it's a single point, or allows an array of points (functionally equivalent to a string).
  • Don't forget that some fonts have upper case equivalents for various ligatures and other characters, while others have none at all. For those lacking an upper case equivalent, we would need to provide a series of characters (e.g., 'ij' -> 'IJ', or 'oe' -> 'OE'). In some fonts, it is possible that glyphs don't exist for upper case forms of some ligatures, even those with Unicode points.
  • In addition to the glyphs in the table, Unicode may provide other ligatures and special characters with or without equivalent upper case forms (e.g., long s, 'n, 't, etc.). A given font may or may not provide a glyph for an upper case form (or even, the lower case form). PDF::Builder could run into the situation where the text requests, say, the 'ffl' ligature, and the font doesn't provide it. Separate letters (3) would have to be substituted.
  • Generic uppercasing of a character or string containing ligatures and special characters is related to Small Caps (and might share code), but there is the complication that it would be font- and encoding-specific. We need to fully understand what Perl's uc() function does on non-ASCII characters (such as accented Latin characters) for various encodings and for UTF-8. It might be possible to offer an extended 'to_upper()' function, given a string, encoding, and font information.
  • Some fonts contain ligatures (e.g., tt, ttl, etc.) that do not have Unicode points. I'm not sure how a user or application would specify these in text in the first place (giving the CID instead of a Unicode value?). We may find it useful to provide a clean way to give such ligatures in text.

 PhilterPaper commented on Nov 12

The GSUB tables in TTF files may provide information about available non-Unicode ligatures (e.g., ttl) in some fonts, which could be used to properly uppercase such ligatures. However, since they do not have Unicode points, they will not be ligatures in the raw text code in the first place (only dynamically during output and glyph selection), so uppercasing may not have any problem. Only those ligatures and special characters (e.g., long s) with defined Unicode points will likely be a problem for uppercasing and small caps.

Note that some fonts define "petite caps", which are similar in function to "small caps", but match the x-height of lowercase letters (with small caps being slightly taller).

 PhilterPaper commented on Nov 25

After exploring the GSUB and GPOS capabilities of OpenType, it appears that good practice is not to use the Unicode ligature points (except possibly for eszett, which is commonly treated as a letter), but to let the rendering system (glyph production and substitution) build ligatures on the fly. This way, letters are always discrete (e.g., 'f' and 'i') rather than already being ligatures in the source ('fi'), and can be capitalized and small-capped without worrying about dealing with ligatures. In addition, the glyph substitution code can insert the original letters for search purposes.

A possible downside is that figuring the width of a word (advancewidth) could get a bit complicated if some letter sequences are replaced by ligatures on-the-fly. At the least, the code cannot simply look up character widths by Unicode point, but has to ask the output routines if they plan to combine any letters into ligatures.

Eszett, long s, and possibly 'n/'s/'t, might still be problematic in capitalization and small caps, and require special treatment. I need to look at whether TrueType/OpenType fonts have any content that helps with determining if a given ligature (such as eszett) has an uppercase or small caps equivalent. I don't think there's any help for ligatures in core or Type1 fonts, although many do have a few ligatures.
« Last Edit: May 12, 2019, 07:50:14 PM by Phil »


Offline Phil

  • Global Moderator
  • Hero Member
  • *****
  • 582
    • View Profile
Re: CTS 13 - Small Caps missing for some ligatures
« Reply #1: January 07, 2019, 09:27:11 PM »
I just pushed to the code repository some improvements to synthetic font handling ($pdf->synfont()). Although I was able to take care of eszett (ß) folding to "SS" for small-caps (and dotless i and dotless j to I and J), I was unable to come up with a fix for ligatures and long s (ſ). The problem is that these other characters will be on alternate planes (plane 1+), and so far I have not found a way to access ASCII letters from those planes. Therefore, a ligature such as "ffi" cannot be replaced by small caps "FFI".
« Last Edit: May 12, 2019, 07:50:32 PM by Phil »