Online documentation and examples output are now available for our products
A meow massages the heart.
— Stuart McMillan
Posted on 2022-Apr-20 at 23:57:00 by Phil
Last update on 2023-Apr-30 at 12:08:00 by Phil
One of the important aspects of text processing is the shaping of a paragraph of text (initially one long string) into multiple lines, in a visually-pleasing manner. This is also known as line splitting (not to be confused with word splitting, for hyphenation purposes, although paragraph shaping often makes use of word-splitting). The intent is to fill up each line to be “reasonably” full, and thus minimize the number of lines used on the page. There are a number of other criteria, to be discussed below, to be considered in deciding where to split the input into lines.
Shaping can be used with proportional (variable width) fonts (the usual case, although there is no reason it can't be used with fixed-pitch (constant width) fonts. So long as all measurements are in the same length units, whatever algorithm you use, should work OK.
The simplest approach is known as the “greedy” algorithm. This is a fairly fast method that simply takes all the words that will fit into a line. When you come to a word which will not fit, you call a hyphenation routine to split up the word, and find the longest chunk (from its beginning) that will fit, including the added hyphen, to be added to this line. The remaining fragment goes on the next line, and if fully-justifying the paragraph, the spaces are expanded to the desired overall length.
Note that you are optimizing one line at a time (fit it as fully as possible), not globally (the entire paragraph, or even the entire document). This means that a decision made in one place can potentially make the rest of the paragraph much worse than it should be (local versus global optimization). By the time you find that a particular line fits very poorly, you can’t go back and change the earlier lines.
Widely known, yet often misunderstood, is the algorithm devised by Donald Knuth and his graduate student Michael Plass, for the latter’s doctoral thesis. This is the “Knuth-Plass” algorithm, also well known as the “TeX” algorithm, as it came to fame in the (La)TeX text processing system.
This method, in a nutshell, is an optimization (minimization) of the overall cost of a weighted path through choices of where to split the input string into individual lines. There are various “costs” (demerits and penalties) associated with each choice, and the intent is to globally (at the paragraph level, not individual lines) minimize this cost. Demerits and penalties include splitting a word (hyphen at the end of a line), multiple hyphenated lines in a row, hyphenating the next-to-last line, lines too loose (widely spaced words) or too tight (too little space between words), too much change in tightness from line to line, and some others.
Naïvely implemented, such an optimization can have as bad a run time as \(O(2^n)\) (\(n\) is the number of split opportunities), or at least, quadratic (\(O(n^2)\)). Fortunately, much of the effort put into Knuth-Plass was in optimizing the process, to radically reduce the number of split opportunities to be examined. This results in performance that is nearly linear (\(O(n)\)) with the number of words!
The greedy algorithm certainly can produce usuable text, although there is wide agreement that TeX (Knuth-Plass) generally produces a better-looking paragraph (see the list of demerits and penalties above, that it tries to minimize). Both typically use word-splitting (hyphenation), but Knuth-Plass tries to keep split words to a minimum, and for all that, usually takes up no additional lines. The downside to using Knuth-Plass is that it assumes that all lines are the same length, so kludges are needed to handle text flow around floats and inserts. It also doesn’t natively handle changes in fonts or font-sizes, needing enhancements to handle these.
In the following sections, I will share thoughts on another approach to paragraph shaping, one a bit more sophisticated than the greedy algorithm, yet easier to comprehend and implement than Knuth-Plass.
Note: your browser may be alarmed by some of these and refuse to download a PDF file without an OK — I think this is just due to the protocol being http instead of https.
Posted on 2022-Apr-20 at 23:52:00 by Phil
Last update on 2022-Dec-10 at 16:20:00 by Phil
There is a long list of criteria for a “good” implementation of paragraph shaping:
Posted on 2022-Apr-21 at 15:23:00 by Phil
So, is there a new algorithm that we can use, instead of either the greedy or the Knuth-Plass ones? Maybe there is an easier-to-implement method (instead of Knuth-Plass), that will properly handle a number of issues that arise with the original algorith, such as non-constant line lengths, page breaks at mid-paragraph, full Unicode compatibility, etc.
I propose to start with something close to the greedy algorithm, but blending in useful aspects of Knuth-Plass in a way that might not be elegant Computer Science, but would produce acceptable results.
You can’t do much if you can’t measure something, and assign some sort of cost or penalty, and then minimize the sum of all costs and penalties (suitably weighted). So, what sort of things can be measured in a paragraph?
Anything missing from this list?
All the above, except for whitespace rivers, paragraph splitting at end of column, and possibly short last line penalty, are found in Knuth-Plass. There is also discussion of optimizing an entire page, but I don’t know what kind of work has been done on this, and whether there are additional things to measure. For the time being, we’ll leave it at paragraph optimization.
Posted on 2022-Apr-21 at 19:49:00 by Phil
Last update on 2022-Dec-14 at 12:10:00 by Phil
We can measure all sorts of things, and come up with a total cost of splitting a given paragraph in a certain way. The weightings of various demerits or costs, and penalties, is yet to be determined, and might even be fine-tuned for an eye-pleasing layout on a case-by-base basis (even varying from paragraph to paragraph!). Where have we taken a different road from Knuth-Plass? So far, not much different, although we are going to have to have some differences to properly handle varying line lengths (due to non-rectangular column enclosures), splitting in the middle of a paragraph (suppress hyphenation on that line), proper Unicode behavior, and suppressing short last lines.
One thing we don’t want to do is repeatedly lay out the entire paragraph while tweaking the split points. That’s just too expensive. Knuth-Plass cheats a bit by assuming that all lines are the same length, which leads to some substantial simplification. Without that, it may be harder. We need to backtrack or look back no more than two or three previous lines, at most, to keep run-time to reasonable levels. Some things in the list just really can’t be done if we can’t update some number of previously built lines. Keep in mind that when a line split point is adjusted, all the running tallies of costs will have to be redone if we’re not careful! Perhaps each line can have some values in its line node so that we don’t have to go back more than the first line that has its split point adjusted. All previous lines will be unaffected. When and if the column is filled, if there is more content in the paragraph, something will have to be done to check that a widow isn’t going to be created in the next column (which may be of different width, and so we need to know the line length). If there will be a widow, we need to either redo this entire page, either with decreased leading to pull the widow into this column, or with increased leading to push an extra line into the next column (unless it creates an orphan in this column!).
Let’s start with what might be called a “semi-greedy” algorithm. Fill the line with complete words (no hyphenation yet) until the next won’t fit. See if the line will be excessively loose (including compared to the previous line’s tightness) if you stop there. See if adding the next word, which will require compressing spaces (\(r < 0\)), will make the line too tight (including compared to the previous line). If neither of those work, is the previous line hyphenated? If not, and this line is not the penultimate line or the last line in the column, go ahead and hyphenate the word, fitting as much as possible into this line, but obeying the tightness limits mentioned before. That will produce reasonable looking output, but may not be globally optimal.
If the line is too loose because it cannot be hyphenated, we will have to go back up to three lines and start adjusting line splits there. Is there a way to decide which of the three previous lines to choose? We want to loosen one or more previous lines to flow some text onto this line until it is tight enough, but we have to be careful not to loosen up those lines so much that they fail to obey their own tightness constraints. Start with a line that is hyphenated, and move the hyphenation point earlier (leftwards) in the word, reflowing up to four lines (and checking tightness and hyphenation constraints). If successful, we’re done and can go on to the next line. If we are on the last line and it is too short, the limit of three previous lines to adjust doesn’t give us much runway, but we can still try loosening up all three lines, which (as long as they continue to obey their tightness constraints) should at least improve the last line.
A note on tightening or loosening a line: if the last word has been split (hyphenated), if we want to tighten it, we can start the splitting from the right (trailing) end of the word, rather than the left (beginning) end. Especially if you quit just after getting "in range" of the desired \(r\) range, you might go one split further in the appropriate direction, even if you do end up going slightly out of range. Conversely, if you want to loosen a line, you could start splitting from the left (particularly if the original split was done from the right, in an effort to tighten the line), particularly if you quit as soon as you’re within the \(r\) range. In both cases, if there is a choice of split points that falls within a reasonable (or at worst, just outside of) \(r\) value range, you can pick which one in an effort to tighten or loosen the line. Keep in mind that a change in tightness for one line may affect many or all of the lines later in the paragraph, possibly leading to undesirable shaping! To speed up the process, store the full unhyphenated word in the data structure, while holding all the possible split points there, too. That way, an adjacent word split (hyphenation) point could be examined at relatively low cost. Don’t forget word splitting rules in other languages, such as Dutch and German.
You will notice that this algorithm is rather ad hoc, and does not compute a total cost (to minimize), as Knuth-Plass does. It would certainly be possible to compute the cost figure after all is said and done, but then the only course of action would be to start from the beginning and redo the whole paragraph with a different split start point and compare the results. At least, we could compute a running total cost, and abandon this attempt once the cost exceeds the best figure found so far (there is no point in continuing, since the cost is guaranteed to be higher). Even if we start one line further along each time, it’s going to be something like \(O(n^2)\), which is undesirable performance. As with Knuth-Plass, we want to make only one pass through the paragraph, calculating costs as we go, and pruning out bad choices, so that we minimize the work done to find a near-optimal path (choices of line-splitting points).
At this point, I think I will try out this semi-greedy algorithm and see how it behaves, before thinking any more about improvements. This may take a while, but I will try to remember to report back with results.
Posted on 2022-May-02 at 12:48:00 by Phil
Last update on 2022-Dec-14 at 12:17:00 by Phil
Another approach that occurs to me would be to make several passes over a paragraph (or page), refining the line splitting with each pass. The first pass would be simply a greedy split that does not split any words (except emergency cases where an intact word won’t even fit in the column all by itself!). Calculate the cost (various penalties and demerits). Find the line with the highest penalty and “fix” it by adjusting the break point (which might actually be on an earlier or later line!). Recalculate the total cost and repeat some number of times.
This is something of a local optimization, and unfortunately involves calculating nearly the entire paragraph (or page) cost function from scratch at each pass (as well as possibly having to update later break points). That alone might scuttle the proposition, as all later lines may need to be updated. Perhaps it can be the basis for some algorithm, if that issue can be suitably addressed. One way might be to limit the number of attempts to “fix” the paragraph to, say, 10% of the number of lines. Keep track of the total penalty score for the original attempt, and for each fix attempt (as well as what was changed up to that point), and pick the lowest score and be done with it.
All content © copyright 2005 – 2023
by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this
site remains under the copyright and license of its owners.
Catskill Technology Services, LLC does not claim copyright over such software.
This page is https://www.catskilltech.com/paragraph-shaping.html
Search Quotations database.
Last updated Fri, 04 Aug 2023 at 11:32 AM