Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering or logging in.

Hyphenation

  • 10 Replies
  • 2999 Views
*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Hyphenation
« May 02, 2017, 05:27:14 AM »
I strongly advise against adding hyphenating code. You'll find yourself in a terrible mess before you know it.

Note that this applies to the hyphenating code. Support for hyphenation is greatly appreciated but should be handled via external libraries/tools.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Hyphenation
« Reply #1: May 02, 2017, 12:33:38 PM »
If I understand your post, you are advising against including hyphenation code (or paragraph-shaping code), and should instead provide an interface to external modules. Is this correct? That's fine by me, and I'm open to simply providing interfaces to good hyphenation/shaping code. Suggestions are more than welcome. Right now, I just have some very simple hyphenation (soft hyphens, camelCase, punctuation, runs of letters or digits, etc., but no word splitting). Hopefully any external modules will cover those, too.

Can I presume that you've taken a quick look at 3.003, just released last night?

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Hyphenation
« Reply #2: May 03, 2017, 05:29:45 AM »
Personally I'd would draw a line between the PDF technical aspects (document structure, graphics, fonts, ...) and typesetting. And paragraph shaping is typesetting (just like changing contrast on images belongs to the realm of image manipulation).

It is fine to provide some basic facilities for paragraph shaping but be very careful to add more, since it won't stop until you have reimplemented LibreOffice. It is fine to provide very basic (and language agnostic) word breaking (i.e. on soft and hard hyphens) but anything else will be great for some and a nuisance for others.

Please bear in mind that many users are using PDF::API to produce native language (or mixed language) documents. So do not hyphenate by default but let the user explicitly ask for it. And don't fall back to "en" if support for the user-designated language is not available.

FWIW, I would not put Hyphenate_en.pm under PDF::API2::Content, probably better under PDF::API2::Utils or something similar. Also, "_en" is too simple. There's en_US, en_CA, en_UK, and so on.

The code in Hyphenate_en.pm is talking about encodings again. Remember, you do not need to deal with encodings in Perl. Just replace the literal 173 by "\x{ad}" and it will just work.

If you insist on splitting on punctuation, you may consider using the builtin character class patterns like [:punct:] .

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Hyphenation
« Reply #3: May 03, 2017, 08:53:31 AM »
All fine points! I welcome critical discussion of what direction this should go in.

Personally I'd would draw a line between the PDF technical aspects (document structure, graphics, fonts, ...) and typesetting. And paragraph shaping is typesetting (just like changing contrast on images belongs to the realm of image manipulation).

It is fine to provide some basic facilities for paragraph shaping but be very careful to add more, since it won't stop until you have reimplemented LibreOffice. It is fine to provide very basic (and language agnostic) word breaking (i.e. on soft and hard hyphens) but anything else will be great for some and a nuisance for others.
So, you would recommend that real paragraph shaping and other typesetting functions be kept out of PDF::API2 and in a separate package (that might call PDF::API2)? That's reasonable. I've been wondering where a good place is to draw the line. Some very, very basic calls like paragraph() and section() were already there and possibly being used, so I'll leave them (unless you can prove that no one is using them). I won't add anything to do markup within a paragraph (bold, italic, etc.) within PDF::API2.

Quote
Please bear in mind that many users are using PDF::API to produce native language (or mixed language) documents. So do not hyphenate by default but let the user explicitly ask for it. And don't fall back to "en" if support for the user-designated language is not available.
I realize that different languages will have different hyphenation rules, and there may even be different rules for different applications (publishers, etc.). As you say, it would probably be better not to hyphenate by default. If a user does request hyphenation, but does not have their language hyphenation support installed, do you think it would be better to not fall back to 'en' (simply refuse to hyphenate)?

This brings up a point that I've long been curious about. In bidirectional (RTL) Middle Eastern languages, what is "left justified" (and thus defining "right justified")? Is it the same side as the "beginning of the line" margin, or is it "left is left"? In other words, to "left justify" Hebrew, would the lines align on the physical right? That is, does justification use a logical left and right, rather than a physical left and right? I suppose the same question arises with Chinese and other East Asian languages when written top-to-bottom… where is justification?

Quote
FWIW, I would not put Hyphenate_en.pm under PDF::API2::Content, probably better under PDF::API2::Utils or something similar. Also, "_en" is too simple. There's en_US, en_CA, en_UK, and so on.
_en was intended to be a basic fallback (at least for English). en_US, etc. should override it (_en would be ignored if en_US was installed and that was the language request).

Do you have a specific reason for installing hyphenation support in some other place than Content? Is some other place better?

Quote
The code in Hyphenate_en.pm is talking about encodings again. Remember, you do not need to deal with encodings in Perl. Just replace the literal 173 by "\x{ad}" and it will just work.
I'll look at that again.

Quote
If you insist on splitting on punctuation, you may consider using the builtin character class patterns like [:punct:] .
I didn't want to split on all punctuation, just places were it would make sense to a reader in the flow of the text. For example, you wouldn't want to split at quotation marks or opening brackets. Also, normally a hard hyphen is not a split point.

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Hyphenation
« Reply #4: May 04, 2017, 08:16:43 AM »
Quote
Some very, very basic calls like paragraph() and section() were already there and possibly being used, so I'll leave them

Yes, that's fine. I'd expect the extension package to have similar (and even improved) functions.

Quote
If a user does request hyphenation, but does not have their language hyphenation support installed, do you think it would be better to not fall back to 'en' (simply refuse to hyphenate)?

Definitely. It is better to have non-hyphenated results than wrongly hyphenated.

Quote
In bidirectional (RTL) Middle Eastern languages, what is "left justified" …

I'm sorry, but I'm not familiar with this.

Quote
Do you have a specific reason for installing hyphenation support in some other place than Content? Is some other place better?

Hyphenate_en.pm doesn't have a relation to PDF. It is a general module providing general functions.

Quote
I didn't want to split on all punctuation, just places were it would make sense to a reader in the flow of the text. For example, you wouldn't want to split at quotation marks or opening brackets.

Human texts do not contain punctuation inside words. I think it's a computer-originated idiom to use things like long_variable_names and CamelCaseWords. And I'm not sure whether I'd want these to be split.

Quote
Also, normally a hard hyphen is not a split point.

Think again. Does the name 'hyphen' ring a bell?

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Hyphenation
« Reply #5: May 04, 2017, 09:00:06 AM »
Quote
I didn't want to split on all punctuation, just places were it would make sense to a reader in the flow of the text. For example, you wouldn't want to split at quotation marks or opening brackets.

Human texts do not contain punctuation inside words. I think it's a computer-originated idiom to use things like long_variable_names and CamelCaseWords. And I'm not sure whether I'd want these to be split.
How about "computer-originated" (or, "re-educated" or "co-operative")? Other than hard hyphens and apostrophes, you're right that splitting should normally be only within words. However, as a practical matter, when you have long URLs, variable names, and other computer stuff, they're going to need to be split up to fit on lines. I could easily see a long URL that won't fit within an entire line — are there typesetting conventions for how to deal with that? E.g., split after a / or _, and do/do not hyphenate?

Perhaps a long computer word should preferably be given its own line if necessary, and only if it's too long for even that, split it at some point?

Quote
Quote
Also, normally a hard hyphen is not a split point.

Think again. Does the name 'hyphen' ring a bell?
Initially I had it always split on a hard hyphen. Then in checking on some English grammatical rules, I read (a number of sources) that a hard hyphen should not be a split point. So I changed it to user-selectable. Maybe it's a language-specific rule?

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Hyphenation
« Reply #6: May 05, 2017, 03:50:13 PM »
Are there typesetting conventions for how to deal with that? E.g., split after a / or _, and do/do not hyphenate?

Perhaps a long computer word should preferably be given its own line if necessary, and only if it's too long for even that, split it at some point?

An old typesetter once taught me that if the text doesn't fit nicely, rewrite it. Trying to stretch or squeeze more than a small amount makes the end result ugly.

I think the main problem is mixing two things that should be distinct: text paragraphs and arbitrary content. While it is (almost always) possible to automatically format a text paragraph (where 'text' is human prose), arbitrary content cannot. Hence arbitrary content should be typeset 'as is', unformatted, possibly in the form of an example, figure, quote or something appropriate.

URLs do normally not occur in formatted text paragraphs, only in badly written articles. Remember, we're producing PDF documents. Why print a long and ugly URL while it can be stashed away as a link?

A good example is (not quite surprising) the PDF Reference documentation. It is formatted very well, there are many, many 'computer words' and yet none of them are broken. Probably the most ugly paragraph is at the end of page 420 (ref. version 1.7) where they decided (and, IMHO correctly) to not break the matrix.

Personally, if URLs are needed in the text, I have made a habit of turning them into footnotes. See e.g. http://johan.vromans.org/articles/wxglade.pdf, page 3.

Quote
Then in checking on some English grammatical rules, I read (a number of sources) that a hard hyphen should not be a split point.

AFAIK, the purpose of a hyphen (hard U+2010, discretionary U+00AD) is to split on. If this is not desired, use non-breaking hyphen (U+2011, yes, the name is confusing). The problem is whether U+002D (ambiguous hyphen) should be treated as U+2010 or as U+2011.

Word processor manuals explicitly advise to use non-breaking hyphens where appropriate (e.g. in telephone numbers) so it is safe to consider U+002D to be a split point. It may, however, be wise to add an option to change this default behaviour.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Hyphenation
« Reply #7: June 09, 2017, 11:26:15 PM »
I finally had some time to get back to thinking about this issue, and here's where things stand. First, there are 4 different kinds of hyphens to worry about:
  • U+002D  ASCII hyphen/minus sign — what almost everyone is going to type in when preparing text to be fed to PDF::Builder. Under most circumstances, it is apparently OK to split a line immediately after a hyphen. I don't know if it would be appropriate to allow splitting to be suppressed here. Some sources say that a hyphen used to attach a prefix or suffix (e.g., "co-operation") should not be a split point (effectively a non-breaking hyphen). Other sources claim that compound words (e.g., "third-year medical student", between "third" and "year") should not be split. It's confusing. This hyphen, which looks the same as the other three, is available on every keyboard, whereas the other three are usually not easy to type in (at least, not without knowing their UTF-8 byte sequence or Unicode value).
  • U+2010  Unicode hyphen — from what I can tell, it should be treated the same as the U+002D hyphen.
  • U+00AD  soft hyphen (optional hyphen, discretionary hyphen) — HTML entity ­, although there have been a lot of conflicting definitions, in the end it seems to mark a place where a word may be split. Note that soft hyphens need to be removed from text before being passed to PDF::Builder's output routines, or they will show up as hyphens in the reader.
  • U+2011  non-breaking hyphen — this looks like a normal hyphen (U+002D), but if at all possible, avoid breaking the word (line) after it. It is used for things such as telephone numbers, some date formats, and Social Security numbers, and ideally they should be kept as one unit, but if the text is longer than the line length (column width), you're gonna have to split it somewhere, and it might as well be after a non-breaking hyphen. If the text has to be split, you might as well first fill up as much as possible of the current line, rather than leaving a huge gap and starting the long text on the next line (and still having to split it).
Regarding the non-appearance on most keyboards of the last three hyphens, perhaps user input handling could include some sort of preprocessor, such as escape sequences (e.g., \- is a SHY, \= is a non-breaking hyphen) to turn them into Latin-1 or UTF-8 characters. For now, this is beyond the scope of PDF::Builder, although it might be added later. We will assume that of these four hyphens, they are native (binary) Latin-1 or UTF-8 sequences, and leave it at that.

We also need to consider whether the Unicode hyphen U+2010 and non-breaking hyphen U+2011 should be replaced by normal hyphens (U+002D) for consistent appearance (assuming that possibly the font either doesn't have U+2010 or U+2011, or they look different). Soft hyphens (U+00AD) all need to be removed anyway, so if one ends up being used as a split point, it will be replaced by a normal hyphen anyway.

There is a U+2012 "figure dash", which may look like an en-dash (U+2013), as well as an em-dash and a quotation dash, but I have no plans to deal with these (should we?). It is usually permissible to split a line after an em-dash (without adding a hyphen, of course), but not a figure- or en-dash. A quotation dash, which apparently looks much like an em-dash, is used unpaired before the attribution of a quote, so you probably would never break after it, although possibly before (if it's an inline attribution).

I will make the default not to hyphenate, and let the user explicitly choose to hyphenate. There are a few calls in base PDF::Builder (text fill, paragraph, section, etc.) which should probably implement some level of line (word) splitting, but full-fledged paragraph shaping will not be built into the base PDF::Builder. Paragraph shaping involves getting all the possible word splits in the paragraph, and deciding when and where to hyphenate to get the best appearance. This can mean minimizing the sum of numeric "penalties" for too many consecutive lines ending in hyphens, "rivers" of white space, splitting of proper names and titles (language- and culture-dependent), widows and orphans (which means you need to find out if the following paragraph will have at least 4 lines of output), hyphenation on the last word of a column (or worse, a page), too-short of last lines in a paragraph, and probably other considerations. Such items may be language- and even publisher-dependent, and (at least) the settings would have to be made specific to language and publisher, but an actual paragraph shaping routine might be itself language-independent.

Where words can be split is language-dependent, and may also depend on typesetting standards of a given publisher. English is straightforward in the sense that you simply find a split point (per rules and exceptions list), stick a hyphen on the end of the first fragment, and start the next line with the remainder (which in turn may need to be split again!). Some languages, such as German, may require doubling of one or more letters at the split, complicating calculations for line lengths. Anyway, if PDF::Builder itself is going to make use of language/publisher specific splitting libraries, there could be a PDF::Builder::WordSplit::Hyphenate_xx_xx module for each flavor, where xx_xx could be just a language code (like "en") or it could be language+country (e.g., en_GB). I will have to look and see if there is some sort of locale information in PDF::Builder, or if it needs to be added. There could even be publisher-specific extensions (e.g., de_DE_SV for Springer-Verlag German-language texts). Hyphenation would not be done if the requested language support module is not found (no fallback to, say, English), but we could consider allowing en as a fallback for any English en_XX request, or simply require an exact match to avoid unexpected results (could be a setting).

Hyphenate_xx_xx() would be fed either a single word (just doing "greedy" line splitting) or an array of the entire paragraph's words, and return in some form both the word fragments and the source of each split: hyphen or non-breaking hyphen (both of which need to be restored), soft hyphen, or by language algorithm. The paragraph shape routine might use different priorities, such as preferring to split on a soft hyphen or a hard hyphen if available (of equal or different priorities), and then try other splits. There might even be some sort of priority value built into the returned data, indicating where the preferred splits are.

Now, besides normal human prose, there can also be "computer" words, such as camelCase and underscore_separated_words, as well as long URLs with /'s and the like. In technical documents it may not always be possible to avoid typesetting such things (although the result may not be all that elegant). The current code splits camelCase between a lowercase and an Uppercase ASCII letter (note than names such as MacDonald could end up being split Mac- Donald, which is undesirable), as well as after runs of letters (ASCII only) or numbers or after certain punctuation. You don't want to split just after opening brackets [ ( { etc., nor opening (left) quotation marks of various kinds, nor just before ] ) } or closing/right quotation marks. To extend these to non-ASCII letters would be difficult enough for Latin-based alphabets, never mind non-Latin! The current code has hard coded switches, and could be extended to make these a hash in the argument list. We also need to consider whether adding hyphens to a (split) URL or other technical term is risking introducing errors and confusion if the reader thinks the hyphen is actually part of the word! However, it is very easy for URLs etc. to exceed the line length (even in a footnote), and thus require splitting.

The current hyphenation looks only at the last word in a line (that is too long to fit, unsplit). This is known as "greedy" line splitting, and while it makes a paragraph most compact, it takes no action to prevent orphans and widows, nor other undesirable effects (e.g., hyphenated last word on a page). I'm really not sure whether there is a point to doing full splitting (according to language and publisher rules) for the little that paragraph() and section() will be used in full-bore quality typesetting. It would be nice to allow folding of long URLs and other computerese, but it might be better to do proper line splitting in another package.

So, my proposal is to rename Hyphenate_en.pm to Hyphenate_basic.pm (language-independent), make all hyphenation optional (off by default), use only the current forms of hyphenation, no support for U+2010 and U+2011 hyphen variants, and leave all language-specific word splitting to another package. I may make some of the currently hard coded switches accessible in the call as hash elements. The base PDF::Builder has only rudimentary formatting and paragraph formation capabilities (text fill, paragraph, section) and they probably won't get any more enhancement than this level of hyphenation. If someone wants to use them for production, they can supply their text with SHY's already inserted, but will have to put up with no control over widows and orphans and column-break hyphens. Real typesetting (using PDF::Builder as its base) will have to do a much better job of paragraph shaping, and I agree that it's better left to a separate package.

Thoughts and comments?

*

Offline sciurius

  • Jr. Member
  • **
  • 67
    • View Profile
    • Website
Re: Hyphenation
« Reply #8: June 10, 2017, 08:50:52 AM »
The two major points are: language-neutral basic splitting, and it being turned off by default. To which I fullheartly agree.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Hyphenation
« Reply #9: July 25, 2017, 04:22:13 PM »
I came across something interesting in the PDF-1.7 specification. It suggests that when words are split (at other than a hard hyphen), that a soft hyphen be used (which a reader should display like a hard hyphen). When a screen reader or other text scraper sees the soft hyphen at the end of a line, it knows it can simply discard it when gluing the line back together into a long string. Also, resizable PDF reader displays can then reflow text into longer or shorter lines without introducing spurious hard hyphens in the middle of words.

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 430
    • View Profile
Re: Hyphenation
« Reply #10: December 17, 2017, 05:01:47 PM »
I just installed Text::Reflow and hope to have some time to play with it soon. I already see two major problems with it:
  • Line size is in characters, rather than points (or other dimensions). This is OK for fixed-pitch fonts, but will be a major problem
    with proportional fonts. We can't just count characters, but have to see what width each glyph takes up.
  • The rules are for English, or at least, the "no break here" words lists (titles, conjunctives) are English. At the least, a way will have to be provided to allow other languages' titles and conjunctive words to be handled.
I haven't even run Text::Reflow yet, so I don't know how it's splitting within words (or if it even does), and what spelling rules it's using to split. Non-English languages and orthographies will have different rules, including repeating a letter on the next line, which could greatly complicate an algorithm that thinks it can have a split point and that's that. Also, ligatures and other glyph substitutions and positioning probably need to be disposed of first, before words are split.

At this point, I don't see prereq'ing Text::Reflow for PDF::Builder, but perhaps mining it for algorithms and ideas, to extend into my code.
« Last Edit: December 21, 2017, 05:41:22 PM by MrPhil »