Customization

Formatting Lines

If you specify callable object as format property of LineBreak object, it should accept three arguments:

callable_object(self, context, string) -> text_or_None

self is a LineBreak object, context is a string to determine the context that subroutine was called in, and string is a fragment of Unicode string leading or trailing breaking position.

context When Value of string
"sot" Beginning of text Fragment of first line
"sop" After mandatory break Fragment of next line
"sol" After arbitrary break Fragment on sequel of line
"" Just before any breaks Complete line without trailing SPACEs
"eol" Arbitrary break SPACEs leading breaking position
"eop" Mandatory break Newline and its leading SPACEs
"eot" End of text SPACEs (and newline) at end of text

Callable object should return modified text fragment or may return None to express that no modification occurred. Note that modification in the context of "sot", "sop" or "sol" may affect decision of successive breaking positions while in the others won’t.

Note

String arguments are actually sequences of grapheme clusters. See documentation of GCStr class.

For example, following code folds lines removing trailing spaces:

from textseg import LineBreak

def format(self, event, string):
    if event.startswith('eo'):
        return "\n"
    return None

lb = LineBreak(format = format)
output = ''.join([str(s) for s in lb.wrap(text)])

User-Defined Breaking Behaviors

When a line generated by arbitrary break is expected to be beyond measure of either charmax, width or minwidth, urgent break may be performed on successive string. If you specify callable object as a value of urgent attribute, it should accept two arguments:

callable_object(self, string) -> [text, ...]

self is a LineBreak object and string is a Unicode string to be broken.

Callable object should return a list of broken items of string.

Note

String argument is actually a sequence of grapheme clusters. See GCStr class.

For example, following code inserts hyphen to the name of several chemical substances (such as Titin) so that it may be folded:

# Example not yet written

If you specify (regular expression, callable object[, flags]) tuple as any item of prep option, callable object should accept two arguments:

callable_object(self, string) -> [text, ...]

self is a LineBreak object and string is a Unicode string matched with regular expression.

Callable object should return a list of broken items of string.

For example, following code will break HTTP URLs using [CMOS] rule:

urire = re.compile(r'\b(?:url:)?http://[\x21-\x7E]+',
                   re.I + re.U)
def breakURI(self, s):
    r = ''
    ret = []
    b = ''
    for c in s:
        if b == '':
            r = c
        elif r.lower().endswith('url:'):
            ret.append(r)
            r = c
        elif b in '/' and not c in '/' or \
             not b in '-.' and c in '-~.,_?\#%=&' or \
             b in '=&' or c in '=&':
            if r != '':
                ret.append(r)
            r = c
        else:
            r += c
        b = c
    if r != '':
        ret.append(r)
    return ret

output = fill(text, prep = [(urire, breakURI)])

Changed in version 0.1.1: prep attribute accepts tuples with third item flags.

Preserving State

LineBreak object can behave as dictionary. Any items may be preserved throughout its life.

For example, following code will separate paragraphs with empty lines:

# Example not yet written

Calculating String Size

If you specify callable object as a value of sizing property, it will be called with five arguments:

callable_object(self, length, pre, spc, string) -> number_of_columns

self is a LineBreak object, length is size of preceding string, pre is preceding Unicode string, spc is additional SPACEs and string is a Unicode string to be processed.

Callable object should return calculated number of columns of pre + spc + string. The number of columns may not be an integer: Unit of the number may be freely chosen, however, it should be same as those of minwidth and width properties.

Note

String arguments are actually sequences of grapheme clusters. See GCStr class.

For example, following code processes lines with tab stops by each eight columns:

from textseg import fill
from textseg.Consts import lbcSP

def sizing(self, cols, pre, spc, string):
    spcstr = spc + string
    i = 0
    for c in spcstr:
        if c.lbc != lbcSP:
            cols += spcstr[i:].cols
            break
        if c == "\t":
            cols += 8 - (cols % 8)
        else:
            cols += c.cols
        i = i + 1
    return cols

output = fill(text, lbc = {ord("\t"): lbcSP}, sizing = sizing,
              expand_tabs = False)

Tailoring Character Properties

Character properties may be tailored by lbc and eaw options. Some constants are defined for convenience of tailoring.

Line Breaking Properties

Non-starters of Kana-like Characters

textseg.Consts.KANA_NONSTARTERS
textseg.Consts.IDEOGRAPHIC_ITERATION_MARKS
textseg.Consts.KANA_SMALL_LETTERS
textseg.Consts.KANA_PROLONGED_SOUND_MARKS
textseg.Consts.MASU_MARK

By default, several hiragana, katakana and characters corresponding to kana are treated as non-starters (NS or CJ). When the lbc attribute is updated by following items, these characters are treated as normal ideographic characters (ID).

{ KANA_NONSTARTERS: lbcID }
All of characters below.
{ IDEOGRAPHIC_ITERATION_MARKS: lbcID }

Ideographic iteration marks. U+3005 々 IDEOGRAPHIC ITERATION MARK, U+303B 〻 VERTICAL IDEOGRAPHIC ITERATION MARK, U+309D ゝ HIRAGANA ITERATION MARK, U+309E ゞ HIRAGANA VOICED ITERATION MARK, U+30FD ヽ KATAKANA ITERATION MARK and U+30FE ヾ KATAKANA VOICED ITERATION MARK.

Note

Some of them are neither hiragana nor katakana.

{ KANA_SMALL_LETTERS: lbcID }

Hiragana or katakana small letters.

Hiragana small letters: U+3041 ぁ “A”, U+3043 ぃ “I”, U+3045 ぅ “U”, U+3047 ぇ “E”, U+3049 ぉ “O”, U+3063 っ “TU”, U+3083 ゃ “YA”, U+3085 ゅ “YU”, U+3087 ょ “YO”, U+308E ゎ “WA”, U+3095 ゕ “KA”, U+3096 ゖ “KE”.

Katakana small letters: U+30A1 ァ “A”, U+30A3 ィ “I”, U+30A5 ゥ “U”, U+30A7 ェ “E”, U+30A9 ォ “O”, U+30C3 ッ “TU”, U+30E3 ャ “YA”, U+30E5 ュ “YU”, U+30E7 ョ “YO”, U+30EE ヮ “WA”, U+30F5 ヵ “KA”, U+30F6 ヶ “KE”.

Katakana phonetic extensions: U+31F0 ㇰ “KU” - U+31FF ㇿ “RO”.

Halfwidth katakana small letters: U+FF67 ァ “A” - U+FF6F ッ “TU”.

Note

These letters and prolonged sound marks below are optionally treated either as non-starter or as normal ideographic. See [JISX4051] 6.1.1, [JLREQ] 3.1.7 or [UAX14].

Note

U+3095 ゕ “KA”, U+3096 ゖ “KE”, U+30F5 ヵ “KA” and U+30F6 ヶ “KE” are considered to be neither hiragana nor katakana.

{ KANA_PROLONGED_SOUND_MARKS: lbcID }

Hiragana or katakana prolonged sound marks. U+30FC ー KATAKANA-HIRAGANA PROLONGED SOUND MARK and U+FF70 ー HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK.
{ MASU_MARK: lbcID }

U+303C 〼 MASU MARK.

Note

Although this character is not kana, it is usually regarded as abbreviation to sequence of hiragana ま す or katakana マ ス, MA and SU.

Note

This character is classified as non-starter (NS) by [UAX14] and as Class 13 (corresponding to ID) by [JISX4051] and [JLREQ].

Ambiguous Quotation Marks

textseg.Consts.BACKWARD_QUOTES
textseg.Consts.FORWARD_QUOTES
textseg.Consts.BACKWARD_GUILLEMETS
textseg.Consts.FORWARD_GUILLEMETS

By default, some punctuations are ambiguous quotation marks (QU).

{ BACKWARD_QUOTES: lbcOP, FORWARD_QUOTES: lbcCL }
Some languages (Dutch, English, Italian, Portugese, Spanish, Turkish and most East Asian) use rotated-9-style punctuations (‘ “) as opening and 9-style punctuations (’ ”) as closing quotation marks.
{ FORWARD_QUOTES: lbcOP, BACKWARD_QUOTES: lbcCL }
Some others (Czech, German and Slovak) use 9-style punctuations (’ ”) as opening and rotated-9-style punctuations (‘ “) as closing quotation marks.
{ BACKWARD_GUILLEMETS: lbcOP, FORWARD_GUILLEMETS: lbcCL }
French, Greek, Russian etc. use left-pointing guillemets (« ‹) as opening and right-pointing guillemets (» ›) as closing quotation marks.
{ FORWARD_GUILLEMETS: lbcOP, BACKWARD_GUILLEMETS: lbcCL }
German and Slovak use right-pointing guillemets (» ›) as opening and left-pointing guillemets (« ‹) as closing quotation marks.

Danish, Finnish, Norwegian and Swedish use 9-style or right-pointing punctuations (’ ” » ›) as both opening and closing quotation marks.

East_Asian_Width Properties

textseg.Consts.AMBIGUOUS_ALPHABETICS
textseg.Consts.AMBIGUOUS_CYRILLIC
textseg.Consts.AMBIGUOUS_GREEK
textseg.Consts.AMBIGUOUS_LATIN

Some particular letters of Latin, Greek and Cyrillic scripts have ambiguous (A) East_Asian_Width property. Thus, these characters are treated as wide when eastasian_context attribute is true. Updating eaw attribute with following values, those characters are always treated as narrow.

{ AMBIGUOUS_ALPHABETICS: eawN }
Treat all of characters below as East_Asian_Width neutral (N).

{ AMBIGUOUS_CYRILLIC: eawN }

{ AMBIGUOUS_GREEK: eawN }

{ AMBIGUOUS_LATIN: eawN }
Treate letters having ambiguous (A) width of Cyrillic, Greek and Latin scripts as neutral (N).
textseg.Consts.QUESTIONABLE_NARROW_SIGNS

On the other hand, despite several characters were occasionally rendered as wide characters by number of implementations for East Asian character sets, they are given narrow (Na) East_Asian_Width property just because they have fullwidth (F) compatibility characters. Updating eaw attribute with following values, those characters are treated as ambiguous — wide when eastasian_context attribute is true.

{ QUESTIONABLE_NARROW_SIGNS: eawA }
U+00A2 ¢ CENT SIGN, U+00A3 £ POUND SIGN, U+00A5 ¥ YEN SIGN (or yuan sign), U+00A6 ¦ BROKEN BAR, U+00AC ¬ NOT SIGN, U+00AF ¯ MACRON.