CTS logo
hazy blue Catskill Mountains in distance

News:

Check out my pictures and video from the April 2024 total solar eclipse!


A Thought…

I had some friends, Samuel and Ella, who started a diner. They named it after themselves. They were both pretty good cooks… no one could figure why “Sam and Ella’s” failed.

   — inspired by Calvin and Hobbes comic letters

NAME

Text::KnuthPlass - Breaks paragraphs into lines using the TeX (Knuth-Plass) algorithm

SYNOPSIS

To use with plain text, indentation of 2. NOTE that you should also set the shrinkability of spaces to 0 in the new() call:

    use Text::KnuthPlass;
    my $typesetter = Text::KnuthPlass->new(
        'indent' => 2, # two characters,
        # set space shrinkability to 0
        'space' => { 'width' => 3, 'stretch' => 6, 'shrink' -> 0 },
        # can let 'measure' default to character count
        # default line lengths to 78 characters
    );
    my @lines = $typesetter->typeset($paragraph);
    ...

    for my $line (@lines) {
        for my $node (@{$line->{'nodes'}}) {
            if ($node->isa("Text::KnuthPlass::Box")) { 
                # a Box is a word or word fragment (no hyphen on fragment)
                print $node->value();
            } elsif ($node->isa("Text::KnuthPlass::Glue")) {
                # a Glue is (at least) a single space, but you can look at 
                # the line's 'ratio' to insert additional spaces to 
                # justify the line. we also are glossing over the skipping
                # of any final glue at the end of the line
                print " ";
            }
            # ignoring Penalty (word split point) within line
        }
        if ($line->{'nodes'}[-1]->is_penalty()) { print "-"; }
        print "\n";
    }

To use with PDF::Builder: (also PDF::API2)

    my $text = $page->text();
    $text->font($font, 12);
    $text->leading(13.5);

    my $t = Text::KnuthPlass->new(
        'indent' => 2*$text->text_width('M'), # 2 ems
        'measure' => sub { $text->text_width(shift) }, 
        'linelengths' => [235]  # points
    );
    my @lines = $t->typeset($paragraph);

    my $y = 500;  # PDF decreases y down the page
    for my $line (@lines) {
        $x = 50;  # left margin
        for my $node (@{$line->{'nodes'}}) {
            $text->translate($x,$y);
            if ($node->isa("Text::KnuthPlass::Box")) {
                # a Box is a word or word fragment (no hyphen on fragment)
                $text->text($node->value());
                $x += $node->width();
            } elsif ($node->isa("Text::KnuthPlass::Glue")) {
                # a Glue is a variable-width space
                $x += $node->width() + $line->{'ratio'} *
                    ($line->{'ratio'} < 0 ? $node->shrink(): $node->stretch());
                # we also are glossing over the skipping
                # of any final glue at the end of the line
            }
            # ignoring Penalty (word split point) within line
        }
        # explicitly add a hyphen at a line-ending split word
        if ($line->{'nodes'}[-1]->is_penalty()) { $text->text("-"); }
        $y -= $text->leading(); # go to next line down
    }

METHODS

$t = Text::KnuthPlass->new(%opts)

The constructor takes a number of options. The most important ones are:

measure

A subroutine reference to determine the width of a piece of text. This defaults to length(shift), which is what you want if you're typesetting plain monospaced text. You will need to change this to plug into your font metrics if you're doing something graphical. For PDF::Builder (also PDF::API2), this would be the advancewidth() method (alias text_width()), which returns the width of a string (in the present font and size) in points.

    'measure' => sub { length(shift) },  # default, for character output
    'measure' => sub { $text->advancewidth(shift) }, # PDF::Builder/API2
linelengths

This is an array of line lengths. For instance, [30,40,50] will typeset a triangle-shaped piece of text with three lines. What if the text spills over to more than three lines? In that case, the final value in the array is used for all further lines. So to typeset an ordinary block-shaped column of text, you only need specify an array with one value: the default is [78] . Note that this default would be the character count, rather than points (as needed by PDF::Builder or PDF::API2).

    'linelengths' => [$lw, $lw, $lw-6, $lw-6, $lw],

This would set the first two lines in the paragraph to $lw length, the next two to 6 less (such as for a float inset), and finally back to full length. At each line, the first element is consumed, but the last element is never removed. Any paragraph indentation set will result in a shorter-appearing first line, which actually has blank space at its beginning. Start output of the first line at the same x value as you do the other lines.

Setting linelengths in the new() (constructor) call resets the internal line length list to the new elements, overwriting anything that was already there (such as any remaining line lengths left over from a previous typeset() call). Subsequent typeset() calls will continue to consume the existing line length list, until the last element is reached. You can either reset the list for the next paragraph with the typeset() call, or call the linelengths() method to get or set the list.

indent

This sets the global (default) paragraph indentation, unless overridden on a per-paragraph basis by an indent entry in a typeset() call. The units are the same as for meaure and linelengths. A "Box" of value '' and width of indent is inserted before the first node of the paragraph. Your rendering code should know how to handle this by starting at the same x coordinate as other lines, and then moving right (or left) by the indicated amount.

    'indent' => 2,  # 2 character indentation
    'indent' => 2*$text->text_width('M'),  # 2 ems indentation
    'indent' => -3,  # 3 character OUTdent

If the value is negative, a negative-width space Box is added. The overall line will be longer than other lines, by that amount. Again, your rendering code should handle this in a similar manner as with a positive indentation, but move left by the indicated amount. Be careful to have your starting x value far enough to the right that text will not end up being written off-page.

tolerance

How much leeway we have in leaving wider spaces than the algorithm would prefer. The tolerance is the maximum ratio glue expansion value to tolerate in a possible solution, before discarding this solution as so infeasible as to be a waste of time to pursue further. Most of the time, the tolerance is going to have a value in the 1 to 3 range. One approach is to try with tolerance => 1, and if no successful layout is found, try again with 2, and then 3 and perhaps even 4.

hyphenator

An object which hyphenates words. If you have the Text::Hyphen product installed (which is highly recommended), then a Text::Hyphen object is instantiated by default; if not, an object of the class Text::KnuthPlass::DummyHyphenator is instantiated - this simply finds no hyphenation points at all. So to turn hyphenation off, set

    'hyphenator' => Text::KnuthPlass::DummyHyphenator->new()

To typeset non-English text, pass in a Text::Hyphen-like object which responds to the hyphenate method, returning a list of hyphen positions for that particular language (native Text::Hyphen defaults to American English hyphenation rules). (See Text::Hyphen for the interface.)

space

Fine tune space (glue) width, stretchability, and shrinkability.

    'space' => { 'width' => 3, 'stretch' => 6, 'shrink' => 9 },

For typesetting constant width text or output to a text file (characters), we suggest setting the shrink value to 0. This prevents the glue spaces from being shrunk to less than one character wide, which could result in either no spaces between words, or overflow into the right margin.

    'space' => { 'width' => 3, 'stretch' => 6, 'shrink' => 0 },
infinity

The default value for infinity is, as is customary in TeX, 10000. While this is a far cry from the real infinity, so long as it is substantially larger than any other demerit or penalty, it should take precedence in calculations. Both positive and negative inifinity are used in the code for various purposes, including a +inf penalty for something absolutely forbidden, and -inf for something absolutely required (such as a line break at the end of a paragraph).

    'infinity' => 10000,
hyphenpenalty

Set the penalty for an end-of-line hyphen at 50. You may want to try a somewhat higher value, such as 100+, if you see too much hyphenation on output. Remember that excessively short lines are prone to splitting words and being hyphenated, no matter what the penalty is.

    'hyphenpenalty' => 50,

There does not appear to be anything in the code to find and prevent multiple contiguous (adjacent) hyphenated lines, nor to prevent the penultimate (next-to-last) line from being hyphenated, nor to prevent the hyphenation of a line where you anticipate the paragraph to be split between columns. Something may be done in the future about these three special cases, which are considered to not be good typesetting.

demerits

Various demerits used in calculating penalties, including fitness, which is used when line tightness (ratio) changes by more than one class between two lines.

    'demerits' => { 'line' => 10, 'flagged' => 100, 'fitness' => 3000 },

There may be other options for fine-tuning the output. If you know your way around TeX, dig into the source to find out what they are. At some point, this package will support additional tuning by allowing the setting of more parameters which are currently hard-coded. Please let us know if you found any more parameters that would be useful to allow additional tuning!

$t->typeset($paragraph_string, %opts)

This is the main interface to the algorithm, made up of the constituent parts below. It takes a paragraph of text and returns a list of lines (array of hashes) if suitable breakpoints could be found.

The typesetter currently allows several options:

indent

Override the global paragraph indentation value just for this paragraph. This can be useful for instances such as not indenting the first paragraph in a section.

    'indent' => 0,  # default set in new() is 2ems
linelengths

The array of line lengths may be set here, in typeset. As with new(), it will override whatever existing line lengths array is left over from earlier operations.

Possibly (in the future) many other global settings set in new() may be overridden on a per-paragraph basis in typeset().

The returned list has the following structure:

    (
        { 'nodes' => \@nodes, 'ratio' => $ratio },
        { 'nodes' => \@nodes, 'ratio' => $ratio },
        ...
    )

The node list in each element will be a list of objects. Each object will be either Text::KnuthPlass::Box, Text::KnuthPlass::Glue or Text::KnuthPlass::Penalty. See below for more on these.

The ratio is the amount of stretch or shrink which should be applied to each glue element in this line. The corrected width of each glue node should be:

    $node->width() + $line->{'ratio'} *
        ($line->{'ratio'} < 0 ? $node->shrink() : $node->stretch());

Each box, glue or penalty node has a width attribute. Boxes have values, which are the text which went into them (including a wide null blank for paragraph indentation, a special case); glue has stretch and shrink to determine how much it should vary in width. That should be all you need for basic typesetting; for more, see the source, and see the original Knuth-Plass paper in "Digital Typography".

Why typeset rather than something like linesplit? Per "ACKNOWLEDGEMENTS", this code is ported from the Javascript product typeset.

This method is a thin wrapper around the three methods below.

$t->line_lengths()

@list = $t->line_lengths() # Get
$t->line_lengths(@list) # Set

Get or set the linelengths list of allowed line lengths. This permits you to do more elaborate operations on this array than simply replacing (resetting) it, as done in the new() and typeset() methods. For example, at the bottom of a page, you might cancel any further inset for a float, by deleting all but the last element of the list.

    my @temp_LL = $t->line_lengths();
    # cancel remaining line shortening
    splice(@temp_LL, 0, scalar(@temp_LL)-1);
    $t->line_lengths(@temp_LL);

On a "Set" request, you must have at least one length element in the list. If the list is empty, it is assumed to be a "Get" request.

$t->break_text_into_nodes($paragraph_string, %opts)

This turns a paragraph into a list of box/glue/penalty nodes. It's fairly basic, and designed to be overloaded. It should also support multiple justification styles (centering, ragged right, etc.) but this will come in a future release; right now, it just does full justification.

'style' => "string_name"

"justify"

Fully justify the text (flush left and right). This is the default, and currently the only choice implemented.

"left"

Not yet implemented. This will be flush left, ragged right (reversed for RTL scripts).

"right"

Not yet implemented. This will be flush right, ragged left (reversed for RTL scripts).

"center"

Implemented, but not yet fully tested. This is centered text within the indicated line width.

If you are doing clever typography or using non-Western languages you may find that you will want to break text into nodes yourself, and pass the list of nodes to the methods below, instead of using this method.

break

This implements the main body of the algorithm; it turns a list of nodes (produced from the above method) into a list of breakpoint objects.

@lines = $t->breakpoints_to_lines(\@breakpoints, \@nodes)

And this takes the breakpoints and the nodes, and assembles them into lines.

boxclass()

glueclass()

penaltyclass()

For subclassers.

AUTHOR

originally written by Simon Cozens, <simon at cpan.org>

since 2020, maintained by Phil Perry

ACKNOWLEDGEMENTS

This module is a Perl translation (originally by Simon Cozens) of Bram Stein's "Typeset" Javascript Knuth-Plass implementation.

BUGS

Please report any bugs or feature requests to the issues section of https://github.com/PhilterPaper/Text-KnuthPlass.

Do NOT under ANY circumstances open a PR (Pull Request) to report a bug. It is a waste of both your and our time and effort. Open a regular ticket (issue), and attach a Perl (.pl) program illustrating the problem, if possible. If you believe that you have a program patch, and offer to share it as a PR, we may give the go-ahead. Unsolicited PRs may be closed without further action.

COPYRIGHT & LICENSE

Copyright (c) 2011 Simon Cozens.

Copyright (c) 2020-2022 Phil M Perry.

This program is released under the following license: Perl, GPL

 

All content © copyright 2005 – 2024 by Catskill Technology Services, LLC.
All rights reserved.
Note that Third Party software (whether Open Source or proprietary) on this site remains under the copyright and license of its owners. Catskill Technology Services, LLC does not claim copyright over such software.

 

This page is https://www.catskilltech.com/Documentation/Text/KnuthPlass.html

Search Quotations database.

Last updated Wed, 08 Nov 2023 at 11:28 PM

Valid HTML 5

Thu, 12 Sep 2024 at 3:29 PM EDT