CTS logo
hazy blue Catskill Mountains in distance

News:

Give our new Discussions area a try!

PDF::Builder v3.024 Released, 12 September 2022
   Please see the CPAN listing, GitHub entry.

PDF::Table v1.003 Released, 05 July 2022
   Please see the CPAN listing, as well as the GitHub entry.


A Thought…

Those who believe in absurdities are capable of atrocities.

   — Voltaire

Paragraph shaping

Posted on 2022-Apr-20 at 23:57:00 (last update on 2022-Sep-18 at 21:34:00) by Phil

One of the important aspects of text processing is the shaping of a paragraph of text (initially one long string) into multiple lines, in a visually-pleasing manner. This is also known as line splitting (not to be confused with word splitting, for hyphenation purposes, although paragraph shaping often makes use of word-splitting). The intent is to fill up each line to be “reasonably” full, and thus minimize the number of lines used on the page. There are a number of other criteria, to be discussed below, to be considered in deciding where to split the input into lines.

Shaping can be used with proportional (variable width) fonts (the usual case, although there is no reason it can't be used with fixed-pitch (constant width) fonts. So long as all measurements are in the same length units, whatever algorithm you use, should work OK.

Established methods

Greedy algorithm

The simplest approach is known as the “greedy” algorithm. This is a fairly fast method that simply takes all the words that will fit into a line. When you come to a word which will not fit, you call a hyphenation routine to split up the word, and find the longest chunk (from its beginning) that will fit, including the added hyphen, to be added to this line. The remaining fragment goes on the next line, and if fully-justifying the paragraph, the spaces are expanded to the desired overall length.

Note that you are optimizing one line at a time (fit it as fully as possible), not globally (the entire paragraph, or even the entire document). This means that a decision made in one place can potentially make the rest of the paragraph much worse than it should be (local versus global optimization). By the time you find that a particular line fits very poorly, you can’t go back and change the earlier lines.

Knuth-Plass algorithm

Widely known, yet often misunderstood, is the algorithm devised by Donald Knuth and his graduate student Michael Plass, for the latter’s doctoral thesis. This is the “Knuth-Plass” algorithm, also well known as the “TeX” algorithm, as it came to fame in the (La)TeX text processing system.

This method, in a nutshell, is an optimization (minimization) of the overall cost of a weighted path through choices of where to split the input string into individual lines. There are various “costs” (demerits and penalties) associated with each choice, and the intent is to globally (at the paragraph level, not individual lines) minimize this cost. Demerits and penalties include splitting a word (hyphen at the end of a line), multiple hyphenated lines in a row, hyphenating the next-to-last line, lines too loose (widely spaced words) or too tight (too little space between words), too much change in tightness from line to line, and some others.

Naïvely implemented, such an optimization can have as bad a run time as \(O(2^n)\) (\(n\) is the number of split opportunities), or at least, quadratic (\(O(n^2)\)). Fortunately, much of the effort put into Knuth-Plass was in optimizing the process, to radically reduce the number of split opportunities to be examined. This results in performance that is nearly linear (\(O(n)\)) with the number of words!

Which to use?

The greedy algorithm certainly can produce usuable text, although there is wide agreement that TeX (Knuth-Plass) generally produces a better-looking paragraph (see the list of demerits and penalties above, that it tries to minimize). Both typically use word-splitting (hyphenation), but Knuth-Plass tries to keep split words to a minimum, and for all that, usually takes up no additional lines. The downside to using Knuth-Plass is that it assumes that all lines are the same length, so kludges are needed to handle text flow around floats and inserts. It also doesn’t natively handle changes in fonts or font-sizes, needing enhancements to handle these.

In the following sections, I will share thoughts on another approach to paragraph shaping, one a bit more sophisticated that the greedy algorithm, yet easier to comprehend and implement than Knuth-Plass.

Some resources

Note: your browser may be alarmed by some of these and refuse to download a PDF file without an OK — I think this is just due to the protocol being http instead of https.


Posted on 2022-Apr-20 at 23:52:00 (last update on 2022-May 02 at 12:27:00) by Phil

There is a long list of criteria for a “good” implementation of paragraph shaping:

  • Minimize the number of lines output in the finished result, while meeting the other goals. A method is not really any improvement if it takes up much more page space (more lines) than other methods. Knuth-Plass can use up to 10% more lines than Greedy, but the improved look of the results (particularly with fewer hyphenated words) usually outweighs this cost.
  • Minimize the total number of hyphenated lines, whether due to split words or a split in a compound word (explicit manifest or hard hyphen in the input string). Text just looks bad, and is tiring to read, when too many words get split, and the right edge of the paragraph is full of hyphens.
  • Avoid consecutive hyphenated lines, again whether due to split words or compound words. Again, bad-looking results.
  • Try for consistent line lengths. A paragraph looks better when all lines are close to the same length. Full justification (even edges on both left and right) can hide some unevenness through space (glue) stretching, but even if using Ragged Right (flush left) it should be reasonably even. For Ragged Right, try to avoid a complete word “overhanging” empty space on the right edge.
  • Avoid hyphenating the penultimate line (the next-to-last line). It doesn’t make sense to split a word when presumably there is space for it on the short last (next) line. This is probably more of a typesetting custom than having any real effect on readability.
  • Maintain consistent tightness of lines. If a line’s inter-word spaces (“glue” in Knuth-Plass) can be reduced (tighter) or increased (looser), you don’t want to have radical changes in tightness from one line to the next. Knuth-Plass measures this as \(r\), where a value of less than -0.5 is “tight”, -0.5 to +0.5 is “normal”, +0.5 to something like +1.0 is “loose”, and anything stretched greater than +1.0 is “very loose” (slightly penalized as undesirable). Try to avoid changing more than one level going from one line to the next, e.g., tight to normal is OK, or very loose to loose, but tight to loose is excessive and unpleasant to read.
  • Minimize rivers of whitespace flowing through the text, where spaces (glue), especially wide (loose or very loose) ones, more or less align vertically, appearing to split the paragraph into left and right banks. This attracts the eye and can make a reader lose their place. I’m not sure that Knuth-Plass addresses this to any extent. The original paper does say that this is seldom a problem unless very loose lines are permitted.
  • Avoid a very short last line by loosening previous lines so that the last line is at least 25 to 33% of its allowed line length. It just looks bad to see a last line in a paragraph that is only one word or so. Knuth-Plass, in some implementations, may address this (Holkner attempts to). Rubinstein referred to a single word on the last line as a cub, and something to be avoided.
  • Allow a paragraph at the bottom of a page (or column) to be split in the middle. An input would be “I expect that there are \(N\) lines left in this column, so avoid hyphenating line \(N\), and I can put the rest of the paragraph in the next column.” This prohibition on hyphenation should be much stronger than the other prohibitions, as readers don’t like having to remember part of a word when flipping a page. I don’t think that Knuth-Plass addresses this. If the algorithm can handle getting to the bottom of a column and going on to a new column (without manual intervention, such as giving an expected split point), it still has to discourage hyphenation on this last line. Knuth-Plass assumed a page of indefinite length (as well as fixed length lines), and left any paragraph splitting to the calling application.
  • Avoid widows and orphans when splitting a paragraph across a page or column break. Unless the paragraph is only one line long, make sure you have at least two lines at the bottom of a page (or it’s an orphan) and two lines at the top of a page (or it’s a widow). Knuth-Plass does not address this, although Holkner does attempt to. Often you can’t do much about reducing (or increasing) the number of lines in a paragraph through playing with break points and line tightness, so a better solution may be to reduce leading to pull a widow onto this page, or increase it to push another line onto the next page and keep the last line company. Likewise, an orphan can be pushed to the next page though increased leading, or a line can be pulled from the next page though decreased leading, to keep the first line company (assuming this doesn’t create a widow on the next page). Don’t forget that section headings (or graphic dividers) should be kept with the first lines of the paragraph and not orphaned themselves.
  • Avoid a stack of two consecutive lines that begin with the same word (or fragment) or end with the same word or fragment, per Rubinstein (via Holkner). This is undesirable, as the reader’s eye may be misled and pulled onto the wrong line. Knuth-Plass does not address this.
  • Allow varying line lengths, rather than one fixed line length. Note that this refers to the maximum line length for each line being different, and not the fraction of the line filled being consistent, as above. Knuth-Plass is built around a fixed, constant, line length. Many implementations have added kludges to try to support multiple line lengths in some manner (usually an array of line lengths) so that non-rectangular paragraphs can be supported (most often for the purpose of flowing text around a float or other insert).
  • Compatibility with Unicode word and line splitting rules, and proper handling of all Unicode characters. As Knuth-Plass was written from the viewpoint of simple single-byte encodings and (mostly) American English grammar rules, there may be incompatibilities.

Inputs

  • The original Knuth-Plass algorithm assumes a single string in one font and font size. An improvement is to allow an array of structures or hash arrays, each containing some text to put into the paragraph. Variations in the font, including italic and bold text, would require multiple array elements (one for each). A change in (human) language would get its own element, so that the non-default language can be used for hyphenation and possibly for bidirectional language support. A change in font used, or its size, would require their own elements. The output structures of something like Harfbuzz::Shaper (in Perl) could be a good start for the element data. The leading needed by each element would be needed, so that taller text could be accommodated. A very simple text, with no font changes or variations, could be in a single element. If we are continuing output on a new page or column, we need to know which element we are within, and where within that element.
  • An outline of the column to be filled. This could be a polyline or it could include curved (arc, elliptical, spline) sections. it probably cannot intersect itself or otherwise be anything but a simple enclosure with a single interior. It does not have to be convex, as the intent is to allow inserts (shorter lines). The intersection of each line from the top, with the outline, determines the baseline for this line. There should be only a single baseline resulting, which limits concavities (it probably is easiest just to discard all but the longest baseline, if multiple ones result). Many implementations allow an array of line lengths, but the problem with this is that if leading has to be changed, to deal with widows and orphans, the column will now be the wrong shape! For a very simple rectangular column (of indefinite length), a single line length might be given instead.
  • The vertical offset in the column to start at. We might have already shaped one paragraph in this column, and thus need to start part way down the outline of the column.
  • Optional settings and overrides, such as the amount of indentation on the first line of the paragraph (e.g., default 2em, but 0 following a heading), or the leading to use. This implies that there are a number of global (for all paragraphs) settings that could be set. If we are starting a new column, part way through a paragraph, we need to know where to pick up with the processing.

Outputs

  • Modified (or new) array of text segments, ready to render. Each could be an entire line in the paragraph, or just a word or two. The x and y position of each element would be included, so it can be rendered in the right place. Much information will be copied over (font, font size, language, color, etc.) from the input array.
  • Line values: whether a hyphen is needed at the end of the line, space compress/expand (tighter/looser), which might be fed to a word spacing routine rather than individually outputting each word (depends on what the rendering system supports).
  • Any remaining content (position in the input) if we reached the end of the column, and need to start another. Otherwise, the place to start the next paragraph (plus any inter-paragraph vertical space). We should always have a complete last line — no need to pick up in the middle of a line.

Posted on 2022-Apr-21 at 15:23:00 by Phil

So, is there a new algorithm that we can use, instead of either the greedy or the Knuth-Plass ones? Maybe there is an easier-to-implement method (instead of Knuth-Plass), that will properly handle a number of issues that arise with the original algorith, such as non-constant line lengths, page breaks at mid-paragraph, full Unicode compatibility, etc.

I propose to start with something close to the greedy algorithm, but blending in useful aspects of Knuth-Plass in a way that might not be elegant Computer Science, but would produce acceptable results.

What can we measure?

You can’t do much if you can’t measure something, and assign some sort of cost or penalty, and then minimize the sum of all costs and penalties (suitably weighted). So, what sort of things can be measured in a paragraph?

  • Count the number of lines in the paragraph. This is a weak measure — we want to tend towards a lower line count, but it’s not the be-all and end-all.
  • Count the number of hyphens at the end of the lines, both explicit (manifest) and added by word-splitting. We want to reduce this count.
  • Penalize fairly strongly two or more consecutive lines ending with a hyphen, a hyphenated penultimate line, and even more strongly, the last line in a column hyphenated (when the paragraph is split between two columns).
  • Measure the tightness of lines, penalizing both extreme tightness (\(r\) approaching -1.0) and extreme looseness (\(r\) exceeding +1.0), as well as more than one “step” of tightness between two lines (e.g., tight to loose). There is no reason to expand (loosen) the last line, but it might be desirable to tighten it slightly to fit, so \(r\) on the last line should be 0 or less.
  • Measure the variance in line length (before any spaces compressed or expanded). This might be \((\mathrm {line}_{N + 1} - \mathrm {line}_N)^2\) or something similar. Squaring the difference both takes care of plus and minus signs, and emphasizes (for values > 1) greater differences. We wish to minimize variance, even more so if doing Ragged Right justification.
  • Measure the length of the last line, as a percentage of the allowed line length. Penalize a line shorter than some arbitrary amount, such as 33%, unless this is the only line.
  • Knuth-Plass looks for a word completely “floating” above a shorter following line (large space at the end), but I’m not sure how much value this adds, or how easy it is to detect. Perhaps the x position of the beginning of the last word on line \(N\) should not exceed the \(x\) position of the end of the last word on line \(N+1\)? In any case, it would be of more visual importance for a Ragged Right justification.
  • There is discussion in Knuth-Plass regarding “rivers” of whitespace, but it doesn’t sound like they do much about it. I suppose that it could be measured by looking at line 1 whitespace start and finish points, line 2’s points, and seeing if any overlaps exceed a certain percentage of the narrower space (they may be different sizes, depending on \(r\) values).For those that do, compare line 3 to line 2 in a similar manner, seeing if anything continues through all three. Keep repeating, not forgetting to check for new rivers arising between line 2 and 3, etc. Decide on the maximum number of lines that a “river” has to run before it’s visually disturbing (say, 4). Extra weighting might be given to nearly straight rivers flowing down the paragraph, defined as very large overlap percentages.
  • Some typesetting styles like to have their punctuation, including any hyphens, hanging out over the right edge (past the maximum line length). This could be accommodated with an appropriate switch. Any trailing non-alphanumeric(s) on a potential last word for a line would be trimmed off before fitting the word (and then replaced, with the word ending at the max).
  • Hyphenation should be handled with a language-appropriate word-splitting routine (default American English), with any alternate language information carried along with the input text for this purpose. Note should be made that some languages have rather odd hyphenation rules — German and Dutch come to mind, as in some cases repeating, changing, or adding letters at the split! This complicates handling word splitting at the end of a line. Soft (hidden) hyphens should be supported, with a switch to indicate that SHY’s indicate either the only place to split this word, or the preferred place, or just an additional place. In any case, don’t forget to remove all SHY’s before final rendering, as some systems (such as PDF readers) will display them as hard hyphens!

Anything missing from this list?

All the above, except for whitespace rivers, paragraph splitting at end of column, and possibly short last line penalty, are found in Knuth-Plass. There is also discussion of optimizing an entire page, but I don’t know what kind of work has been done on this, and whether there are additional things to measure. For the time being, we’ll leave it at paragraph optimization.


Posted on 2022-Apr-21 at 19:49:00 by Phil

We can measure all sorts of things, and come up with a total cost of splitting a given paragraph in a certain way. The weightings of various demerits or costs, and penalties, is yet to be determined, and might even be fine-tuned for an eye-pleasing layout on a case-by-base basis (even varying from paragraph to paragraph!). Where have we taken a different road from Knuth-Plass? So far, not much different, although we are going to have to have some differences to properly handle varying line lengths (due to non-rectangular column enclosures), splitting in the middle of a paragraph (suppress hyphenation on that line), proper Unicode behavior, and suppressing short last lines.

One thing we don’t want to do is repeatedly lay out the entire paragraph while tweaking the split points. That’s just too expensive. Knuth-Plass cheats a bit by assuming that all lines are the same length, which leads to some substantial simplification. Without that, it may be harder. We need to backtrack or look back no more than two or three previous lines, at most, to keep run-time to reasonable levels. Some things in the list just really can’t be done if we can’t update some number of previously built lines. Keep in mind that when a line split point is adjusted, all the running tallies of costs will have to be redone if we’re not careful! Perhaps each line can have some values in its line node so that we don’t have to go back more than the first line that has its split point adjusted. All previous lines will be unaffected. When and if the column is filled, if there is more content in the paragraph, something will have to be done to check that a widow isn’t going to be created in the next column (which may be of different width, and so we need to know the line length). If there will be a widow, we need to either redo this entire page, either with decreased leading to pull the widow into this column, or with increased leading to push an extra line into the next column (unless it creates an orphan in this column!).

Let’s start with what might be called a “semi-greedy” algorithm. Fill the line with complete words (no hyphenation yet) until the next won’t fit. See if the line will be excessively loose (including compared to the previous line’s tightness) if you stop there. See if adding the next word, which will require compressing spaces (\(r < 0\)), will make the line too tight (including compared to the previous line). If neither of those work, is the previous line hyphenated? If not, and this line is not the penultimate line or the last line in the column, go ahead and hyphenate the word, fitting as much as possible into this line, but obeying the tightness limits mentioned before. That will produce reasonable looking output, but may not be globally optimal.

If the line is too loose because it cannot be hyphenated, we will have to go back up to three lines and start adjusting line splits there. Is there a way to decide which of the three previous lines to choose? We want to loosen one or more previous lines to flow some text onto this line until it is tight enough, but we have to be careful not to loosen up those lines so much that they fail to obey their own tightness constraints. Start with a line that is hyphenated, and move the hyphenation point earlier (leftwards) in the word, reflowing up to four lines (and checking tightness and hyphenation constraints). If successful, we’re done and can go on to the next line. If we are on the last line and it is too short, the limit of three previous lines to adjust doesn’t give us much runway, but we can still try loosening up all three lines, which (as long as they continue to obey their tightness constraints) should at least improve the last line.

You will notice that this algorithm is rather ad hoc, and does not compute a total cost (to minimize), as Knuth-Plass does. It would certainly be possible to compute the cost figure after all is said and done, but then the only course of action would be to start from the beginning and redo the whole paragraph with a different split start point and compare the results. At least, we could compute a running total cost, and abandon this attempt once the cost exceeds the best figure found so far (there is no point in continuing, since the cost is guaranteed to be higher). Even if we start one line further along each time, it’s going to be something like \(O(n^2)\), which is undesirable performance. As with Knuth-Plass, we want to make only one pass through the paragraph, calculating costs as we go, and pruning out bad choices, so that we minimize the work done to find a near-optimal path (choices of line-splitting points).

At this point, I think I will try out this semi-greedy algorithm and see how it behaves, before thinking any more about improvements. This may take a while, but I will try to remember to report back with results.


Posted on 2022-May-02 at 12:48:00 by Phil

Another approach that occurs to me would be to make several passes over a paragraph (or page), refining the line splitting with each pass. The first pass would be simply a greedy split that does not split any words (except emergency cases where an intact word won’t even fit in the column all by itself!). Calculate the cost (various penalties and demerits). Find the line with the highest penalty and “fix” it by adjusting the break point (which might actually be on an earlier or later line!). Recalculate the total cost and repeat some number of times.

This is something of a local optimization, and unfortunately involves calculating nearly the entire paragraph (or page) cost function from scratch at each pass (as well as possibly having to update later break points). That alone might scuttle the proposition, as all later lines may need to be updated. Perhaps it can be the basis for some algorithm, if that issue can be suitably addressed.

 

All content © copyright 2005 – 2022 by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this site remains under the copyright and license of its owners. Catskill Technology Services, LLC does not claim copyright over such software.

 

This page is https://www.catskilltech.com/paragraph-shaping.html

Search Quotations database.

Last updated Sun, 18 Sep 2022 at 9:34 PM

Valid HTML 5

Sat, 24 Sep 2022 at 6:01 PM EDT