Subject: hooks for line-splitting
Date: Tue, 02 Sep 2014 12:13:47 -0400
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry
PDF::API2 v2.022 Perl 5.16.3 Windows 7 severity: Wishlist
Content.pm's text_fill_*() methods can currently only split a line at a space (x20) character. It would be good to be able to properly hyphenate words, to better fill a line. It's easy enough to split at camelCase, internal non-letters (hard hyphens, digits, punctuation), and at soft hyphens (­). It's fairly involved to properly split complete words, and different languages have different rules. I think that the first three cases could be implemented in the text_fill_*() methods, but we might have to pass control to a user-supplied routine for splitting of complete words.
#
Mon May 04 00:09:59 2015 steve [...] deefs.net - Correspondence added
econtrario contributed a patch to implement part of this a few months ago:
https://bitbucket.org/ssimms/pdfapi2/pull-request/2/_text_fill_line-with-space-hyphen-and-soft/diff[This repository has disappeared. Perhaps it's somewhere on GitHub now? -- Mod.]It needs some tests to be added.
#
Mon May 04 00:10:00 2015 The RT System itself - Status changed from 'new' to 'open'
#
Subject: Re: [rt.cpan.org #98548] hooks for line-splitting
Date: Mon, 4 May 2015 16:10:44 -0400
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry
Steve,
I'm a bit concerned that the new code is using hard-coded single byte encoding for SHY and (?) xC2. xAD is SHY in Latin-1, but xC2 appears to be Â, so I'm not sure what encoding this is. At any rate, before committing any new non-ASCII character handling code, I think we should decide how we want to handle various encodings. Splitting words will require knowing if we're in the middle of a single multibyte character.
#
Subject: [rt.cpan.org #98548]
Date: Sun, 24 Jan 2016 16:40:58 -0500
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry
Ah, I see I misread the code. It's apparently not two Latin-1 characters xC2 and xAD (one of which is SHY), but a UTF-8 representation of a SHY. A few comment lines in the code would have helped. Anyway, we still have the issue of what character encoding we're working in -- can we count on one particular encoding, or should we be able to handle a variety of encodings? Many, if not all, the font sets supplied for PDF appear to be in something close to Windows-1252 (more or less Latin-1), so can we even work with UTF-8 text? Before we embark on changes hard coded for one encoding or another, let's be clear what character encodings are even possible to use.
#
Wed Feb 17 16:49:59 2016 steve [...] deefs.net - Correspondence added
Given encoding issues and the complications of implementing hyphenation rules for multiple languages, this is something that's better left to an add-on module.
#
Wed Feb 17 16:50:00 2016 steve [...] deefs.net - Status changed from 'open' to 'rejected'
#
Subject: [rt.cpan.org #98548]
Date: Thu, 18 Feb 2016 15:34:39 -0500
To: bug-PDF-API2 [...] rt.cpan.org
From: Phil M Perry
True, but should we think about building in some simple word splitting scenarios? I would really like to split words after hyphens, but beyond that, it could get messy with non-ASCII characters. You don't want arbitrary (non-language sensitive) word splitting between accented Latin characters and ASCII letters, without being fully aware of the encoding used. You also don't want to end up accidentally splitting within a UTF-8 multibyte character. Em and en dashes, non-breaking spaces, soft hyphens, and various thickness space characters will depend on the encoding. ASCII characters and text are easy enough, but what to do about anything not ASCII? Perhaps allow splitting only between ASCII characters (0xxxxxxx byte) for now? It should be safe for multibyte UTF-8 characters, as all bytes for non-ASCII start with a 1 bit (1xxxxxxx). I think we could safely break between ASCII characters for hyphen and other non-letters in the range x21..x7E, and letters (letter to non-letter, or non-letter to letter transition, as well as lower-to-upper and upper-to-lower camelCase). Would that be useful? A dummy hook might be put in for future calling of user-supplied hyphenation routines for various encodings and languages, or just mark the spot in the code for now. For English, at least, a minimum of two characters must be left on each line, and be careful about not splitting something like O'Mallory into O'-Mallory, or thinking Ma is camelCase and splitting it O'M-allory.
I'd sure like to get some other people participating in this discussion, to get some more viewpoints and algorithm experience. Perhaps we should just go ahead with starting the add-on module with the above simple algorithm, and flesh it out over time?