I finally had some time to get back to thinking about this issue, and here's where things stand. First, there are 4 different kinds of hyphens to worry about:
- U+002D ASCII hyphen/minus sign — what almost everyone is going to type in when preparing text to be fed to PDF::Builder. Under most circumstances, it is apparently OK to split a line immediately after a hyphen. I don't know if it would be appropriate to allow splitting to be suppressed here. Some sources say that a hyphen used to attach a prefix or suffix (e.g., "co-operation") should not be a split point (effectively a non-breaking hyphen). Other sources claim that compound words (e.g., "third-year medical student", between "third" and "year") should not be split. It's confusing. This hyphen, which looks the same as the other three, is available on every keyboard, whereas the other three are usually not easy to type in (at least, not without knowing their UTF-8 byte sequence or Unicode value).
- U+2010 Unicode hyphen — from what I can tell, it should be treated the same as the U+002D hyphen.
- U+00AD soft hyphen (optional hyphen, discretionary hyphen) — HTML entity ­, although there have been a lot of conflicting definitions, in the end it seems to mark a place where a word may be split. Note that soft hyphens need to be removed from text before being passed to PDF::Builder's output routines, or they will show up as hyphens in the reader.
- U+2011 non-breaking hyphen — this looks like a normal hyphen (U+002D), but if at all possible, avoid breaking the word (line) after it. It is used for things such as telephone numbers, some date formats, and Social Security numbers, and ideally they should be kept as one unit, but if the text is longer than the line length (column width), you're gonna have to split it somewhere, and it might as well be after a non-breaking hyphen. If the text has to be split, you might as well first fill up as much as possible of the current line, rather than leaving a huge gap and starting the long text on the next line (and still having to split it).
Regarding the non-appearance on most keyboards of the last three hyphens, perhaps user input handling could include some sort of preprocessor, such as escape sequences (e.g., \- is a SHY, \= is a non-breaking hyphen) to turn them into Latin-1 or UTF-8 characters. For now, this is beyond the scope of PDF::Builder, although it might be added later. We will assume that of these four hyphens, they are native (binary) Latin-1 or UTF-8 sequences, and leave it at that.
We also need to consider whether the Unicode hyphen U+2010 and non-breaking hyphen U+2011 should be replaced by normal hyphens (U+002D) for consistent appearance (assuming that possibly the font either doesn't have U+2010 or U+2011, or they look different). Soft hyphens (U+00AD) all need to be removed anyway, so if one ends up being used as a split point, it will be replaced by a normal hyphen anyway.
There is a U+2012 "figure dash", which may look like an en-dash (U+2013), as well as an em-dash and a quotation dash, but I have no plans to deal with these (should we?). It is usually permissible to split a line
after an em-dash (
without adding a hyphen, of course), but not a figure- or en-dash. A quotation dash, which apparently looks much like an em-dash, is used unpaired before the attribution of a quote, so you probably would never break
after it, although possibly
before (if it's an inline attribution).
I will make the default
not to hyphenate, and let the user explicitly choose to hyphenate. There are a few calls in base PDF::Builder (text fill, paragraph, section, etc.) which should probably implement some level of line (word) splitting, but full-fledged paragraph shaping will not be built into the base PDF::Builder. Paragraph shaping involves getting all the possible word splits in the paragraph, and deciding when and where to hyphenate to get the best appearance. This can mean minimizing the sum of numeric "penalties" for too many consecutive lines ending in hyphens, "rivers" of white space, splitting of proper names and titles (language- and culture-dependent), widows and orphans (which means you need to find out if the
following paragraph will have at least 4 lines of output), hyphenation on the last word of a column (or worse, a page), too-short of last lines in a paragraph, and probably other considerations. Such items may be language- and even publisher-dependent, and (at least) the settings would have to be made specific to language and publisher, but an actual paragraph shaping routine might be itself language-independent.
Where words can be split is language-dependent, and may also depend on typesetting standards of a given publisher. English is straightforward in the sense that you simply find a split point (per rules and exceptions list), stick a hyphen on the end of the first fragment, and start the next line with the remainder (which in turn may need to be split again!). Some languages, such as German, may require doubling of one or more letters at the split, complicating calculations for line lengths. Anyway,
if PDF::Builder itself is going to make use of language/publisher specific splitting libraries, there could be a PDF::Builder::WordSplit::Hyphenate_xx_xx module for each flavor, where xx_xx could be just a language code (like "en") or it could be language+country (e.g., en_GB). I will have to look and see if there is some sort of locale information in PDF::Builder, or if it needs to be added. There could even be publisher-specific extensions (e.g., de_DE_SV for Springer-Verlag German-language texts). Hyphenation would not be done if the requested language support module is not found (no fallback to, say, English), but we could consider allowing
en as a fallback for any English en_XX request, or simply require an
exact match to avoid unexpected results (could be a setting).
Hyphenate_xx_xx() would be fed either a single word (just doing "greedy" line splitting) or an array of the entire paragraph's words, and return in some form both the word fragments and the source of each split: hyphen or non-breaking hyphen (both of which need to be restored), soft hyphen, or by language algorithm. The paragraph shape routine might use different priorities, such as preferring to split on a soft hyphen or a hard hyphen if available (of equal or different priorities), and then try other splits. There might even be some sort of priority value built into the returned data, indicating where the preferred splits are.
Now, besides normal human prose, there can also be "computer" words, such as camelCase and underscore_separated_words, as well as long URLs with /'s and the like. In technical documents it may not always be possible to avoid typesetting such things (although the result may not be all that elegant). The current code splits camelCase between a lowercase and an Uppercase ASCII letter (note than names such as MacDonald could end up being split Mac- Donald, which is undesirable), as well as after runs of letters (ASCII only) or numbers or after certain punctuation. You don't want to split just after opening brackets [ ( { etc., nor opening (left) quotation marks of various kinds, nor just before ] ) } or closing/right quotation marks. To extend these to non-ASCII letters would be difficult enough for Latin-based alphabets, never mind non-Latin! The current code has hard coded switches, and could be extended to make these a hash in the argument list. We also need to consider whether adding hyphens to a (split) URL or other technical term is risking introducing errors and confusion if the reader thinks the hyphen is actually part of the word! However, it is very easy for URLs etc. to exceed the line length (even in a footnote), and thus require splitting.
The current hyphenation looks only at the last word in a line (that is too long to fit, unsplit). This is known as "greedy" line splitting, and while it makes a paragraph most compact, it takes no action to prevent orphans and widows, nor other undesirable effects (e.g., hyphenated last word on a page). I'm really not sure whether there is a point to doing full splitting (according to language and publisher rules) for the little that paragraph() and section() will be used in full-bore quality typesetting. It
would be nice to allow folding of long URLs and other computerese, but it might be better to do proper line splitting in another package.
So, my proposal is to rename Hyphenate_en.pm to Hyphenate_basic.pm (language-independent), make all hyphenation optional (
off by default), use only the current forms of hyphenation, no support for U+2010 and U+2011 hyphen variants, and leave all language-specific word splitting to another package. I
may make some of the currently hard coded switches accessible in the call as hash elements. The base PDF::Builder has only rudimentary formatting and paragraph formation capabilities (text fill, paragraph, section) and they probably won't get any more enhancement than this level of hyphenation. If someone wants to use them for production, they can supply their text with SHY's already inserted, but will have to put up with no control over widows and orphans and column-break hyphens. Real typesetting (using PDF::Builder as its base) will have to do a much better job of paragraph shaping, and I agree that it's better left to a separate package.
Thoughts and comments?