Paragraph shaping
Posted on 2022-Apr-20 at 23:57:00 by Phil
Last update on 2023-Apr-30 at 12:08:00 by Phil
One of the important aspects of text processing is the shaping of a
paragraph of text (initially one long string) into multiple lines, in a
visually-pleasing manner. This is also known as line splitting
(not to be confused with word splitting, for hyphenation purposes,
although paragraph shaping often makes use of word-splitting). The
intent is to fill up each line to be “reasonably” full, and thus
minimize the number of lines used on the page. There are a number of other
criteria, to be discussed below, to be considered in deciding where to split
the input into lines.
Shaping can be used with proportional (variable width) fonts (the usual
case, although there is no reason it can't be used with fixed-pitch (constant
width) fonts. So long as all measurements are in the same length units, whatever
algorithm you use, should work OK.
Established methods
Greedy algorithm
The simplest approach is known as the “greedy” algorithm. This
is a fairly fast method that simply takes all the words that will fit into a
line. When you come to a word which will not fit, you call a hyphenation
routine to split up the word, and find the longest chunk (from its beginning)
that will fit, including the added hyphen, to be added to this line.
The remaining fragment goes on the next line, and if fully-justifying the
paragraph, the spaces are expanded to the desired overall length.
Note that you are optimizing one line at a time (fit it as fully as
possible), not globally (the entire paragraph, or even the entire document).
This means that a decision made in one place can potentially make the rest of
the paragraph much worse than it should be (local versus global optimization).
By the time you find that a particular line fits very poorly, you can’t
go back and change the earlier lines.
Knuth-Plass algorithm
Widely known, yet often misunderstood, is the algorithm devised by Donald
Knuth and his graduate student Michael Plass, for the latter’s doctoral
thesis. This is the “Knuth-Plass” algorithm, also well known as
the “TeX” algorithm, as it came to fame in the (La)TeX text
processing system.
This method, in a nutshell, is an optimization (minimization) of the overall
cost of a weighted path through choices of where to split the input string into
individual lines. There are various “costs” (demerits and
penalties) associated with each choice, and the intent is to globally
(at the paragraph level, not individual lines) minimize this cost.
Demerits and penalties include splitting a word (hyphen at the end of a
line), multiple hyphenated lines in a row, hyphenating the next-to-last line,
lines too loose (widely spaced words) or too tight (too little space between
words), too much change in tightness from line to line, and some others.
Naïvely implemented, such an optimization can have as bad a run time as
\(O(2^n)\) (\(n\) is the number of split opportunities), or
at least, quadratic (\(O(n^2)\)). Fortunately, much of the effort
put into Knuth-Plass was in optimizing the process, to radically reduce the
number of split opportunities to be examined. This results in performance that
is nearly linear (\(O(n)\)) with the number of words!
Which to use?
The greedy algorithm certainly can produce usable text, although
there is wide agreement that TeX (Knuth-Plass) generally produces a
better-looking paragraph (see the list of demerits and penalties above, that
it tries to minimize). Both typically use word-splitting (hyphenation), but
Knuth-Plass tries to keep split words to a minimum, and for all that, usually
takes up no additional lines. The downside to using Knuth-Plass is that it
assumes that all lines are the same length, so kludges are needed to handle
text flow around floats and inserts. It also doesn’t natively handle
changes in fonts or font-sizes, needing enhancements to handle these.
In the following sections, I will share thoughts on another approach to
paragraph shaping, one a bit more sophisticated than the greedy algorithm, yet
easier to comprehend and implement than Knuth-Plass.
Some resources
Note: your browser may be alarmed by some of these and refuse to
download a PDF file without an OK — I think this is just due to the
protocol being http instead of https.
Posted on 2022-Apr-20 at 23:52:00 by Phil
Last update on 2024-Jan-13 at 13:40:00 by Phil
There is a long list of criteria for a “good”
implementation of paragraph shaping:
- Line Count
- Minimize the number of lines output in the finished result, while
meeting the other goals. A method is not really any improvement if it takes up
much more page space (more lines) than other methods. Knuth-Plass can
use up to 10% more lines than Greedy, but the improved look of the results
(particularly with fewer hyphenated words) usually outweighs this cost.
- Avoid widows and orphans when splitting a paragraph across a page
or column break. Unless the paragraph is only one line long, make sure you
have at least two lines at the bottom of a page (or it’s an orphan)
and two lines at the top of a page (or it’s a widow). Knuth-Plass
does not address this, although Holkner does attempt to.
Often you can’t
do much about reducing (or increasing) the number of lines in a paragraph
through playing with break points and line tightness, so a better solution may
be to reduce leading to pull a widow onto this page, or increase it to
push another line onto the next page and keep the last line company. Likewise,
an orphan can be pushed to the next page though increased leading, or a line
can be pulled from the next page though decreased leading, to keep the first
line company (assuming this doesn’t create a widow on the next page).
Don’t forget that section headings (or graphic dividers) should be
kept with the first lines of the paragraph and not orphaned themselves.
- Hyphenation
- Minimize the total number of hyphenated lines, whether due to
split words or a split in a compound word (explicit manifest or hard hyphen in
the input string). Text just looks bad, and is tiring to read, when too many
words get split, and the right edge of the paragraph is full of hyphens.
- Avoid consecutive hyphenated lines, again whether due to split
words or compound words. Again, bad-looking results.
- Avoid hyphenating the penultimate line (the next-to-last line).
It doesn’t make sense to split a word when presumably there is space for
it on the short last (next) line. This is probably more of a typesetting custom
than having any real effect on readability. There are plenty of cases imaginable
where refusing to hyphenate a line will create other problems, from an
excessively loose line resulting, to a widow being created, to the next
(ultimate) line being too long to fit on just one line, and a very short
ultimate line results… Assign a very small penalty to this (if at all),
rather than making a lot of problems elsewhere.
- Allow a paragraph at the bottom of a page (or column) to be split in
the middle. An input would be “I expect that there are \(N\) lines
left in this column, so avoid hyphenating line \(N\), and I can put the rest
of the paragraph in the next column.” This prohibition on hyphenation
should be much stronger than the other prohibitions, as readers don’t
like having to remember part of a word when flipping a page. I don’t
think that Knuth-Plass addresses this. If the algorithm can handle getting to
the bottom of a column and going on to a new column (without manual
intervention, such as giving an expected split point), it still has to
discourage hyphenation on this last line. Knuth-Plass assumed a page of
indefinite length (as well as fixed length lines), and left any paragraph
splitting to the calling application.
- Line Lengths
- Try for consistent line lengths. A paragraph looks better when
all lines are close to the same length. Full justification (even edges on
both left and right) can hide some unevenness through space (glue) stretching,
but even if using Ragged Right (flush left) it should be reasonably even. For
Ragged Right, try to avoid a complete word (even a very short one or two letter
word) “overhanging” empty space on the right edge.
- Avoid a very short last line by loosening previous lines so that
the last line is at least 25 to 33% of its allowed line length. It just looks
bad to see a last line in a paragraph that is only one word or so. Some
algorithms are content to penalize only a single word on the last line (what if
it is a very long word?), but others may want to fill a minimum
percentage of the available length. Perhaps, fill to a minimum length, and if
it’s only a single word, let it pass. Knuth-Plass, in some
implementations, may address this (Holkner attempts to).
Rubinstein referred to
a single word on the last line as a cub, and something to be avoided.
- Allow varying line lengths, rather than one fixed line length.
Note that this refers to the maximum line length for each line being different,
and not the fraction of the line filled being consistent, as above.
Knuth-Plass is built around a fixed, constant, line length. Many implementations
have added kludges to try to support multiple line lengths in some manner
(usually an array of line lengths) so that non-rectangular paragraphs can be
supported (most often for the purpose of flowing text around a float or other
insert).
- Tightness
- Maintain reasonable tightness of lines. If a line’s
inter-word spaces (“glue” in Knuth-Plass) are cut to less than
approximately half their nominal size, the line becomes difficult to read, as
the words run together. If a line’s glue is stretched too much
(more than roughly doubling the length), it becomes difficult to read due to too
much inter-word spacing. In extreme cases, the reader’s eye can end up
jumping up or down a line as it scans across!
- Maintain consistent tightness of lines. If a line’s
inter-word spaces (“glue” in Knuth-Plass) can be reduced (tighter)
or increased (looser), you don’t want to have radical changes in
tightness from one line to the next. Knuth-Plass measures this as \(r\),
where a value of less than -0.5 is “tight”, -0.5 to +0.5 is
“normal”, +0.5 to something like +1.0 is “loose”, and
anything stretched greater than +1.0 is “very loose” (slightly
penalized as undesirable). Try to avoid changing more than one level going from
one line to the next, e.g., tight to normal is OK, or very loose to loose, but
tight to loose is excessive and unpleasant to read. Knuth-Plass bins sentences
into one of four categories (tight, normal, loose, very loose) and permits up
to one level change up or down. A better algorithm might be to calculate the
final \(r\) of a sentence, and allow the next sentence to be \(r \pm
0.3 \) without penalty and \(r \pm 0.6\) with some small penalty (and a much
larger penalty beyond that).
Something else that can be done is to adjust tracking in
lines. This is tweaking the inter-letter spacing within words, in order to make
inter-word spacing more consistent (i.e., similar-sized spaces on each
line). Don’t carry this to extremes, but it can be used to make the
result more visually pleasing.
- Minimize rivers of whitespace flowing through the text, where
spaces (glue), especially wide (loose or very loose) ones, more or less
align vertically, appearing to split the paragraph into left and right banks.
This attracts the eye and can make a reader lose their place.
I’m not sure that Knuth-Plass addresses this to any extent. The original
paper does say that this is seldom a problem unless very loose lines
are permitted.
- Avoid a stack of two consecutive lines that begin with the
same word (or fragment) or end with the same word or fragment, per Rubinstein
(via Holkner).
This is undesirable, as the reader’s eye may be misled
and pulled onto the wrong line. Knuth-Plass does not address this.
- Compatibility with Unicode word and line splitting rules, and
proper handling of all Unicode characters. As Knuth-Plass was written from the
viewpoint of simple single-byte encodings and (mostly) American English grammar
rules, there may be incompatibilities.
Inputs
- The original Knuth-Plass algorithm assumes a single string in one
font and font size. An improvement is to allow
an array of structures or hash arrays, each containing some text to put
into the paragraph. Variations in the font, including italic and bold text,
would require multiple array elements (one for each). A change in (human)
language would get its own element, so that the non-default language can be
used for hyphenation and possibly for bidirectional language support. A change
in font used, or its size, would require their own elements. The output
structures of something like Harfbuzz::Shaper (in Perl) could be a good start
for the element data. The leading needed by each element would be
needed, so that taller text could be accommodated. A very simple text, with no
font changes or variations, could be in a single element. If we are continuing
output on a new page or column, we need to know which element we are within,
and where within that element.
- An outline of the column to be filled. This could be a polyline
or it could include curved (arc, elliptical, spline) sections. It presumably
cannot intersect itself or otherwise be anything but a simple enclosure with a
single interior.
It does not have to be convex, as the intent is to allow inserts
(shorter lines). The intersection of each line from the top, with the outline,
determines the baseline for this line. There should be only a single baseline
resulting, which limits concavities (it probably is easiest just to discard all
but the longest baseline, if multiple ones result). Many implementations allow
an array of line lengths, but the problem with this is that if leading has to
be changed, to deal with widows and orphans, the column will now be the wrong
shape! For a very simple rectangular column (of indefinite length), a single
line length might be given instead.
- The vertical offset in the column to start at. We might have
already shaped one paragraph in this column, and thus need to start part way
down the outline of the column.
- Optional settings and overrides, such as the amount of indentation
on the first line of the paragraph (e.g., default 2em, but 0 following a
heading), or the leading to use. This implies that there are a number of global
(for all paragraphs) settings that could be set. If we are starting a new
column, part way through a paragraph, we need to know where to pick up with
the processing.
Outputs
- Modified (or new) array of text segments, ready to render.
Each could be an entire line in
the paragraph, or just a word or two. The x and y position of each element
would be included, so it can be rendered in the right place. Much information
will be copied over (font, font size, language, color, etc.) from the input
array.
- Line values: whether a hyphen is needed at the end of the line,
space compress/expand (tighter/looser), which might be fed to a word spacing
routine rather than individually outputting each word (depends on what the
rendering system supports).
- Any remaining content (position in the input) if we reached the
end of the column, and need to start another. Otherwise, the place to start the
next paragraph (plus any inter-paragraph vertical space). We should always have
a complete last line — no need to pick up in the middle of a line.
Posted on 2022-Apr-21 at 15:23:00 by Phil
Last update on 2024-Jan-16 at 14:10:00 by Phil
So, is there a new algorithm that we can use, instead of either the greedy
or the Knuth-Plass ones? Maybe there is an easier-to-implement method (instead
of Knuth-Plass), that will properly handle a number of issues that arise with
the original algorith, such as non-constant line lengths, page breaks at
mid-paragraph, full Unicode compatibility, etc.
I propose to start with something close to the greedy algorithm, but
blending in useful aspects of Knuth-Plass in a way that might not be elegant
Computer Science, but would produce acceptable results.
What can we measure?
You can’t do much if you can’t measure something, and assign
some sort of cost or penalty, and then minimize the sum of all costs and
penalties (suitably weighted). So, what sort of things can be measured in a
paragraph?
- Count the number of lines in the paragraph. This is a weak measure —
we want to tend towards a lower line count, but it’s not the be-all and
end-all.
- Count the number of hyphens at the end of the lines, both explicit
(manifest) and added by word-splitting. We want to reduce this count.
- Penalize fairly strongly two or more consecutive lines ending with a
hyphen, a hyphenated penultimate line, and even more strongly, the last line in
a column hyphenated (when the paragraph is split between two columns).
- Measure the tightness of lines, penalizing both extreme tightness
(\(r\) approaching -1.0) and extreme looseness (\(r\) exceeding +1.0), as
well as more than one “step” of tightness between two lines (e.g.,
tight to loose). There is no reason to expand (loosen) the last line, but it
might be desirable to tighten it slightly to fit, so \(r\) on the last line
should be 0 or less.
- Measure the variance in line length (before any spaces compressed or
expanded). This might be
\((\mathrm {line}_{N + 1} - \mathrm {line}_N)^2\) or
something similar. Squaring the
difference both takes care of plus and minus signs, and emphasizes (for values
> 1) greater differences. We wish to minimize variance, even more so if doing
Ragged Right justification.
- Measure the length of the last line, as a percentage of the allowed line
length. Penalize a line shorter than some arbitrary amount, such as 33%, unless
this is the only line.
- Knuth-Plass looks for a word completely “floating” above a
shorter following line (large space at the end), but I’m not sure how much
value this adds, or how easy it is to detect. Perhaps the x position of the
beginning of the last word on line \(N\) should not exceed the \(x\) position of
the end of the last word on line \(N+1\)? In any case, it would be of more
visual importance for a Ragged Right justification.
- There is discussion in Knuth-Plass regarding “rivers” of
whitespace, but it doesn’t sound like they do much about it. I suppose
that it could be measured by looking at line 1 whitespace start and finish
points, line 2’s points, and seeing if any overlaps exceed a certain
percentage of the narrower space (they may be different sizes, depending on
\(r\) values).For those that do, compare line 3 to line 2 in a similar
manner, seeing if anything continues through all three. Keep repeating, not
forgetting to check for new rivers arising between line 2 and 3, etc. Decide on
the maximum number of lines that a “river” has to run before
it’s visually disturbing (say, 3 or 4). Extra weighting might be given to
nearly straight rivers flowing down the paragraph, defined as very large
overlap percentages. Also pay attention to diagonal
rivers, which can still be very eye-catching.
- Some typesetting styles like to have their punctuation, including any
hyphens, hanging out over the right edge (past the maximum line length). This
could be accommodated with an appropriate switch. Any trailing
non-alphanumeric(s) on a potential last word for a line would be trimmed off
before fitting the word (and then replaced, with the word ending at the max).
Or, you can do what KP recommends, and make the trailing punctuation 0-length,
with the lost width added to the following space. Generally, you want to avoid
having more than one punctuation character hanging over the end of the line, so
this could force hyphenation of the word, or even moving the entire thing to
the following line!
- Hyphenation should be handled with a language-appropriate word-splitting
routine (default American English), with any alternate language information
carried along with the input text for this purpose. Note should be made that
some languages have rather odd hyphenation rules — German and Dutch come
to mind, as in some cases repeating, changing, or adding letters at the
split! This complicates handling word splitting at the end of a line. Soft
(hidden) hyphens should be supported, with a switch to indicate that
SHY’s indicate either the only place to split this word, or the
preferred place, or just an additional place. In any case,
don’t forget to remove all SHY’s before final rendering, as some
systems (such as PDF readers) will display them as hard hyphens! Some
authorities suggest using soft hyphens at a line split, rather than hard
hyphens. They should render the same, but some Readers and Browsers may do a
better job of audio or braille output, or other reformatting of the text, if
they see a soft hyphen (and thus know that a word split at the end of a line
should be closed back up).
Anything missing from this list?
All the above, except for whitespace rivers, paragraph splitting at end
of column, and possibly short last line penalty, are found in Knuth-Plass.
There is also discussion of optimizing an entire page, but I
don’t know what kind of work has been done on this, and whether there are
additional things to measure. For the time being, we’ll leave it at
paragraph optimization.
Posted on 2022-Apr-21 at 19:49:00 by Phil
Last update on 2022-Dec-14 at 12:10:00 by Phil
We can measure all sorts of things, and come up with a total cost of
splitting a given paragraph in a certain way. The weightings of various
demerits or costs, and penalties, is yet to be determined, and might even be
fine-tuned for an eye-pleasing layout on a case-by-base basis (even varying
from paragraph to paragraph!). Where have we
taken a different road from Knuth-Plass? So far, not much different, although
we are going to have to have some differences to properly handle varying
line lengths (due to non-rectangular column enclosures), splitting in the
middle of a paragraph (suppress hyphenation on that line), proper Unicode
behavior, and suppressing short last lines.
One thing we don’t want to do is repeatedly lay out the entire
paragraph while tweaking the split points. That’s just too expensive.
Knuth-Plass cheats a bit by assuming that all lines are the same length, which
leads to some substantial simplification. Without that, it may be harder. We
need to backtrack or look back no more than two or three previous lines, at
most, to keep run-time to reasonable levels. Some things in the list just
really can’t be done if we can’t update some number of
previously built lines. Keep in mind that when a line split point
is adjusted, all the running tallies of costs will have to be redone if
we’re not careful! Perhaps each line can have some values in its line
node so that we don’t have to go back more than the first line that has
its split point adjusted. All previous lines will be unaffected. When and if the
column is filled, if there is more content in the paragraph, something will have
to be done to check that a widow isn’t going to be created in the next
column (which may be of different width, and so we need to know the line
length). If there will be a widow, we need to either redo this entire
page, either with decreased leading to pull the widow into this column, or with
increased leading to push an extra line into the next column (unless it creates
an orphan in this column!).
Let’s start with what might be called a “semi-greedy”
algorithm. Fill the line with complete words (no hyphenation yet) until the
next won’t fit. See if the line will be excessively loose (including
compared to the previous line’s tightness) if you stop there. See if
adding the next word, which will require compressing spaces (\(r < 0\)),
will make the line too tight (including compared to the previous line). If
neither of those work, is the previous line hyphenated? If not, and this line
is not the penultimate line or the last line in the column, go ahead and
hyphenate the word, fitting as much as possible into this line, but obeying the
tightness limits mentioned before. That will produce reasonable looking
output, but may not be globally optimal.
If the line is too loose because it cannot be hyphenated, we will have to
go back up to three lines and start adjusting line splits there. Is there a way
to decide which of the three previous lines to choose? We want to loosen one
or more previous lines to flow some text onto this line until it is tight
enough, but we have to be careful not to loosen up those lines so much that
they fail to obey their own tightness constraints. Start with a line that is
hyphenated, and move the hyphenation point earlier (leftwards) in the word,
reflowing up to four lines (and checking tightness and hyphenation constraints).
If successful, we’re done and can go on to the next line. If we are on
the last line and it is too short, the limit of three previous lines to
adjust doesn’t give us much runway, but we can still try loosening up all
three lines, which (as long as they continue to obey their tightness
constraints) should at least improve the last line.
A note on tightening or loosening a line: if the last word has been split
(hyphenated), if we want to tighten it, we can start the splitting from the
right (trailing) end of the word, rather than the left (beginning) end.
Especially if you quit just after getting "in range" of the desired \(r\) range,
you might go one split further in the appropriate direction, even if you
do end up going slightly out of range. Conversely, if you want to loosen
a line, you could start splitting from the left (particularly if the original
split was done from the right, in an effort to tighten the line), particularly
if you quit as soon as
you’re within the \(r\) range. In both cases, if there is a choice of
split points that falls within a reasonable (or at worst, just outside of) \(r\)
value range, you can pick which one in an effort to tighten or loosen the line.
Keep in mind that a change in tightness for one line may affect many or all of
the lines later in the paragraph, possibly leading to undesirable shaping! To
speed up the process, store the full unhyphenated word in the data structure,
while holding all the possible split points there, too. That way, an adjacent
word split (hyphenation) point could be examined at relatively low cost.
Don’t forget word splitting rules in other languages, such as Dutch and
German.
You will notice that this algorithm is rather ad hoc, and does not compute a
total cost (to minimize), as Knuth-Plass does. It would certainly be possible to
compute the cost figure after all is said and done, but then the only
course of action would be to start from the beginning and redo the whole
paragraph with a different split start point and compare the results. At least,
we could compute a running total cost, and abandon this attempt once the cost
exceeds the best figure found so far (there is no point in continuing, since
the cost is guaranteed to be higher). Even if we start one line further along
each time, it’s going to be something like \(O(n^2)\),
which is undesirable performance. As with Knuth-Plass, we want to make only one
pass through the paragraph, calculating costs as we go, and pruning out bad
choices, so that we minimize the work done to find a near-optimal path (choices
of line-splitting points).
At this point, I think I will try out this semi-greedy algorithm and see how
it behaves, before thinking any more about improvements. This may take a while,
but I will try to remember to report back with results.
Posted on 2022-May-02 at 12:48:00 by Phil
Last update on 2022-Dec-14 at 12:17:00 by Phil
Another approach that occurs to me would be to make several passes over a
paragraph (or page), refining the line splitting with each pass. The first pass
would be simply a greedy split that does not split any words (except
emergency cases where an intact word won’t even fit in the column all by
itself!). Calculate the cost (various penalties and demerits). Find the line
with the highest penalty and “fix” it by adjusting the break point
(which might actually be on an earlier or later line!). Recalculate the total
cost and repeat some number of times.
This is something of a local optimization, and unfortunately involves
calculating nearly the entire paragraph (or page) cost function from scratch at
each pass (as well as possibly having to update later break points). That alone
might scuttle the proposition, as all later lines may need to be updated.
Perhaps it can be the basis for some algorithm, if that issue can be suitably
addressed. One way might be to limit the number of attempts to “fix”
the paragraph to, say, 10% of the number of lines. Keep track of the total
penalty score for the original attempt, and for each fix attempt (as well as
what was changed up to that point), and pick the lowest score and be done with
it.