Post without Account — your post will be reviewed, and if appropriate, posted under Anonymous. You can also use this link to report any problems registering.

RT 98548 - hooks for line-splitting

  • 3 Replies
  • 1483 Views
*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 364
    • View Profile
RT 98548 - hooks for line-splitting
« October 21, 2016, 03:35:41 PM »
Subject:    hooks for line-splitting
Date:    Tue, 02 Sep 2014 12:13:47 -0400
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry

PDF::API2 v2.022   Perl 5.16.3  Windows 7   severity: Wishlist

Content.pm's text_fill_*() methods can currently only split a line at a space (x20) character. It would be good to be able to properly hyphenate words, to better fill a line. It's easy enough to split at camelCase, internal non-letters (hard hyphens, digits, punctuation), and at soft hyphens (­). It's fairly involved to properly split complete words, and different languages have different rules. I think that the first three cases could be implemented in the text_fill_*() methods, but we might have to pass control to a user-supplied routine for splitting of complete words.
#
Mon May 04 00:09:59 2015 steve [...] deefs.net - Correspondence added

econtrario contributed a patch to implement part of this a few months ago:
https://bitbucket.org/ssimms/pdfapi2/pull-request/2/_text_fill_line-with-space-hyphen-and-soft/diff
[This repository has disappeared. Perhaps it's somewhere on GitHub now? -- Mod.]

It needs some tests to be added.
#
Mon May 04 00:10:00 2015 The RT System itself - Status changed from 'new' to 'open'
#
Subject:    Re: [rt.cpan.org #98548] hooks for line-splitting
Date:    Mon, 4 May 2015 16:10:44 -0400
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry

Steve,

I'm a bit concerned that the new code is using hard-coded single byte encoding for SHY and (?) xC2. xAD is SHY in Latin-1, but xC2 appears to be Â, so I'm not sure what encoding this is. At any rate, before committing any new non-ASCII character handling code, I think we should decide how we want to handle various encodings. Splitting words will require knowing if we're in the middle of a single multibyte character.
#
Subject:    [rt.cpan.org #98548]
Date:    Sun, 24 Jan 2016 16:40:58 -0500
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry

Ah, I see I misread the code. It's apparently not two Latin-1 characters xC2 and xAD (one of which is SHY), but a UTF-8 representation of a SHY. A few comment lines in the code would have helped. Anyway, we still have the issue of what character encoding we're working in -- can we count on one particular encoding, or should we be able to handle a variety of encodings? Many, if not all, the font sets supplied for PDF appear to be in something close to Windows-1252 (more or less Latin-1), so can we even work with UTF-8 text? Before we embark on changes hard coded for one encoding or another, let's be clear what character encodings are even possible to use.
#
Wed Feb 17 16:49:59 2016 steve [...] deefs.net - Correspondence added

Given encoding issues and the complications of implementing hyphenation rules for multiple languages, this is something that's better left to an add-on module.
#
Wed Feb 17 16:50:00 2016 steve [...] deefs.net - Status changed from 'open' to 'rejected'
#
Subject:    [rt.cpan.org #98548]
Date:    Thu, 18 Feb 2016 15:34:39 -0500
To:    bug-PDF-API2 [...] rt.cpan.org
From:    Phil M Perry

True, but should we think about building in some simple word splitting scenarios? I would really like to split words after hyphens, but beyond that, it could get messy with non-ASCII characters. You don't want arbitrary (non-language sensitive) word splitting between accented Latin characters and ASCII letters, without being fully aware of the encoding used. You also don't want to end up accidentally splitting within a UTF-8 multibyte character. Em and en dashes, non-breaking spaces, soft hyphens, and various thickness space characters will depend on the encoding. ASCII characters and text are easy enough, but what to do about anything not ASCII? Perhaps allow splitting only between ASCII characters (0xxxxxxx byte) for now? It should be safe for multibyte UTF-8 characters, as all bytes for non-ASCII start with a 1 bit (1xxxxxxx). I think we could safely break between ASCII characters for hyphen and other non-letters in the range x21..x7E, and letters (letter to non-letter, or non-letter to letter  transition, as well as lower-to-upper and upper-to-lower camelCase). Would that be useful? A dummy hook might be put in for future calling of user-supplied hyphenation routines for various encodings and languages, or just mark the spot in the code for now. For English, at least, a minimum of two characters must be left on each line, and be careful about not splitting something like O'Mallory into O'-Mallory, or thinking Ma is camelCase and splitting it O'M-allory.

I'd sure like to get some other people participating in this discussion, to get some more viewpoints and algorithm experience. Perhaps we should just go ahead with starting the add-on module with the above simple algorithm, and flesh it out over time?
« Last Edit: April 01, 2017, 12:30:44 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 364
    • View Profile
Re: RT 98548 - hooks for line-splitting
« Reply #1: April 03, 2017, 11:05:30 AM »
Status report: I have started the implementation of a hyphenation (word splitting) routine so that text-fill methods work better. Right now, it splits at soft and hard hyphens, in camelCase, after runs of digits, after runs of ASCII letters, and after specific ASCII punctuation. Currently, it does not recognize non-ASCII letters or punctuation. It is split out into an independent module, Hyphenate_en, for English hyphenation rules (other languages will get their own modules). Currently, I don't have any code to split normal English words (all letters), and am looking at sources for hyphenation algorithms.

Note that this looks only at one line at a time. It does not implement fancier algorithms, such as Knuth-Plass, that attempt to balance hyphenation over multiple lines to avoid multiple consecutive lines ending in a hyphen, and "rivers of whitespace" flowing down a paragraph.

This code is still preliminary, and could be significantly revised. Among the issues I'm looking at:

  • It is possible that additional parameters or options will be added to permit the fine-tuning of behavior without having to edit the code. For instance, turning on and off camelCase splitting.

  • I'm looking at ways to suppress splitting at a given point in an input string, via a flag (option) or added parameter. This would be for cases where the automatic hyphenation is splitting a word at an inappropriate point, and you want to suppress that split on a case-by-case basis, without disabling other instances of such a split via a flag/option or code change.

  • Should a lot of common (non-language-specific) material be pulled out into another (common) routine, or leave it in Hyphenate_$lang.pm? Should things like hard and soft hyphens, camelCase, punctuation, runs of letters and digits, etc. be under language-specific routines, or put into one common routine? I'm not familiar with word-splitting rules in other languages (except that German may require doubling of the last letter at a split in some circumstances).

  • Non-ASCII letters and punctuation are not yet supported. I am updating the documentation to remind users of PDF::API2 to convert ($string = Encode::decode(SOURCE_ENCODING, $source) strings containing non-ASCII characters. This will affect camelCase, runs-of-letters, and punctuation splitting. It is important to get en- and em-dashes supported as split points. (They should also not be split before the dash.)

  • Clarification on priority levels for splitting. Currently all split points are treated equally (build up large @splitLoc), but should it be configurable to look for highest-priority split points (e.g., hard and  soft hyphens first) and split there if any found, and only if none are found, move down to the next split priority? Flags would be added to set this behavior.

  • Right now, once the word has been examined for potential split points, it searches from right to left for the first split that fits the specified width. I'm thinking about speeding up the trial-and-error splitting by estimating the start point based on $width/$em, and finding the closest @splitLoc entry to start at. In real life, most words are reasonably short, and it may not be worth the extra code to speed up the rare extra-long word.

  • Are soft hyphens always 173? For many single-byte encodings, they may not be. If we require internal wide encoding (UTF-8) for any string containing non-ASCII characters, this would likely be a non-issue.

  • The text-fill utility first splits up lines on ASCII blanks (0x20). Are there any other forms of "spaces" that preliminary splitting should also be done on? Naturally, required blanks (non-breaking spaces) should not be treated this way (as a point to split text into words). Should runs of spaces be condensed into single spaces (like HTML does), or honored in the PDF output?

  • This "greedy" algorithm looks only at one line at a time, and can hyphenate the last word on a page, which is undesirable. It can also leave a single line ("widow") at the end of a paragraph, to go to the next page, and can leave an undesirably short final line on a paragraph. Finally, it will split on hard hyphens ("-" within a word), which may be undesirable (a flag could control this). It may be a good idea to revisit the whole concept of paragraph formation (per Knuth-Plass?) to look at the paragraph and its line breaking as a whole. This would be quite a change to the existing code, so perhaps it should be put off to a new module.

  • If Hyphenate_$lang.pm is not installed, should we fall back to English (en), the current behavior, or turn off hyphenation altogether? A lot of non-language-specific splitting could still be done.
I'm probably going to lay off this for a few weeks or more, as tax time is approaching and I also have some repairs and cleanup around the house I need to make after a snowy winter. I'll keep thinking about it, but I won't promise that this will be out in release 3.003. Also, even when it is released, I want developers and users of PDF::API2 to be aware that the final hyphenation capabilities may not be written in stone for some time to come! I'd like to have people playing with this, and giving feedback on what could be improved, before calling it a final release (because some changes may be not backwards compatible).
« Last Edit: April 03, 2017, 01:23:38 PM by Phil »

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 364
    • View Profile
Re: RT 98548 - hooks for line-splitting
« Reply #2: April 04, 2017, 05:50:44 PM »
Re: #9. Code has been revised to not split on a hard (explicit) hyphen (-), by default. There is a switch to control it. It also will not split at the end of a digit run, letter run, or after punctuation, if the next character is a hard hyphen (-).

*

Offline Phil

  • Global Moderator
  • Sr. Member
  • *****
  • 364
    • View Profile
Re: RT 98548 - hooks for line-splitting
« Reply #3: December 25, 2017, 03:17:54 PM »
 PhilterPaper commented on Nov 13

This one is going to take some additional thought about how much hyphenation should be handled by the basic PDF::Builder code, and how much should be left to higher level typesetting packages (i.e., paragraph shaping). I'll mark it as 'stalled' for now, but leave it open.