.. pytextseg documentation ============= Customization ============= .. _`Formatting Lines`: Formatting Lines ================ If you specify callable object as format property of LineBreak object, it should accept three arguments:: callable_object(self, context, string) -> text_or_None *self* is a LineBreak object, *context* is a string to determine the context that subroutine was called in, and *string* is a fragment of Unicode string leading or trailing breaking position. +-----------+----------------------+----------------------------------+ | *context* | When | Value of *string* | +===========+======================+==================================+ | ``"sot"`` | Beginning of text | Fragment of first line | +-----------+----------------------+----------------------------------+ | ``"sop"`` | After mandatory break| Fragment of next line | +-----------+----------------------+----------------------------------+ | ``"sol"`` | After arbitrary break| Fragment on sequel of line | +-----------+----------------------+----------------------------------+ | ``""`` | Just before any | Complete line without trailing | | | breaks | SPACEs | +-----------+----------------------+----------------------------------+ | ``"eol"`` | Arbitrary break | SPACEs leading breaking position | +-----------+----------------------+----------------------------------+ | ``"eop"`` | Mandatory break | Newline and its leading SPACEs | +-----------+----------------------+----------------------------------+ | ``"eot"`` | End of text | SPACEs (and newline) at end of | | | | text | +-----------+----------------------+----------------------------------+ Callable object should return modified text fragment or may return ``None`` to express that no modification occurred. Note that modification in the context of ``"sot"``, ``"sop"`` or ``"sol"`` may affect decision of successive breaking positions while in the others won't. .. note:: String arguments are actually sequences of grapheme clusters. See documentation of GCStr class. For example, following code folds lines removing trailing spaces:: from textseg import LineBreak def format(self, event, string): if event.startswith('eo'): return "\n" return None lb = LineBreak(format = format) output = ''.join([str(s) for s in lb.wrap(text)]) .. _`User-Defined Breaking Behaviors`: User-Defined Breaking Behaviors =============================== When a line generated by arbitrary break is expected to be beyond measure of either :attr:`charmax`, :attr:`width` or :attr:`minwidth`, **urgent break** may be performed on successive string. If you specify callable object as a value of :attr:`urgent` attribute, it should accept two arguments:: callable_object(self, string) -> [text, ...] *self* is a :class:`LineBreak` object and *string* is a Unicode string to be broken. Callable object should return a list of broken items of *string*. .. note:: String argument is actually a sequence of grapheme clusters. See :class:`GCStr` class. For example, following code inserts hyphen to the name of several chemical substances (such as Titin) so that it may be folded:: # Example not yet written If you specify ``(regular expression, callable object[, flags])`` tuple as any item of :attr:`prep` option, callable object should accept two arguments:: callable_object(self, string) -> [text, ...] *self* is a :class:`LineBreak` object and *string* is a Unicode string matched with *regular expression*. Callable object should return a list of broken items of *string*. For example, following code will break HTTP URLs using [CMOS]_ rule:: urire = re.compile(r'\b(?:url:)?http://[\x21-\x7E]+', re.I + re.U) def breakURI(self, s): r = '' ret = [] b = '' for c in s: if b == '': r = c elif r.lower().endswith('url:'): ret.append(r) r = c elif b in '/' and not c in '/' or \ not b in '-.' and c in '-~.,_?\#%=&' or \ b in '=&' or c in '=&': if r != '': ret.append(r) r = c else: r += c b = c if r != '': ret.append(r) return ret output = fill(text, prep = [(urire, breakURI)]) .. versionchanged:: 0.1.1 prep attribute accepts tuples with third item *flags*. .. _`Preserving State`: Preserving State ---------------- :class:`LineBreak` object can behave as dictionary. Any items may be preserved throughout its life. For example, following code will separate paragraphs with empty lines:: # Example not yet written .. _`Calculating String Size`: Calculating String Size ======================= If you specify callable object as a value of :attr:`sizing` property, it will be called with five arguments:: callable_object(self, length, pre, spc, string) -> number_of_columns *self* is a :class:`LineBreak` object, *length* is size of preceding string, *pre* is preceding Unicode string, *spc* is additional SPACEs and *string* is a Unicode string to be processed. Callable object should return calculated number of columns of ``pre + spc + string``. The number of columns may not be an integer: Unit of the number may be freely chosen, however, it should be same as those of :attr:`minwidth` and :attr:`width` properties. .. note:: String arguments are actually sequences of grapheme clusters. See :class:`GCStr` class. For example, following code processes lines with tab stops by each eight columns:: from textseg import fill from textseg.Consts import lbcSP def sizing(self, cols, pre, spc, string): spcstr = spc + string i = 0 for c in spcstr: if c.lbc != lbcSP: cols += spcstr[i:].cols break if c == "\t": cols += 8 - (cols % 8) else: cols += c.cols i = i + 1 return cols output = fill(text, lbc = {ord("\t"): lbcSP}, sizing = sizing, expand_tabs = False) .. _`Tailoring Character Properties`: Tailoring Character Properties ============================== .. currentmodule:: textseg.Consts Character properties may be tailored by :attr:`lbc` and :attr:`eaw` options. Some constants are defined for convenience of tailoring. Line Breaking Properties ------------------------ Non-starters of Kana-like Characters ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. data:: KANA_NONSTARTERS .. data:: IDEOGRAPHIC_ITERATION_MARKS .. data:: KANA_SMALL_LETTERS .. data:: KANA_PROLONGED_SOUND_MARKS .. data:: MASU_MARK By default, several hiragana, katakana and characters corresponding to kana are treated as :term:`non-starter`\ s (NS or CJ). When the :attr:`lbc` attribute is updated by following items, these characters are treated as normal :term:`ideographic character`\ s (ID). ``{ KANA_NONSTARTERS: lbcID }`` All of characters below. ``{ IDEOGRAPHIC_ITERATION_MARKS: lbcID }`` Ideographic iteration marks. |udl3005|, |udl303B|, |udl309D|, |udl309E|, |udl30FD| and |udl30FE|. .. note:: Some of them are neither hiragana nor katakana. ``{ KANA_SMALL_LETTERS: lbcID }`` Hiragana or katakana small letters. Hiragana small letters: |uds3041|, |uds3043|, |uds3045|, |uds3047|, |uds3049|, |uds3063|, |uds3083|, |uds3085|, |uds3087|, |uds308E|, |uds3095|, |uds3096|. Katakana small letters: |uds30A1|, |uds30A3|, |uds30A5|, |uds30A7|, |uds30A9|, |uds30C3|, |uds30E3|, |uds30E5|, |uds30E7|, |uds30EE|, |uds30F5|, |uds30F6|. Katakana phonetic extensions: |uds31F0| - |uds31FF|. Halfwidth katakana small letters: |udsFF67| - |udsFF6F|. .. note:: These letters and prolonged sound marks below are optionally treated either as non-starter or as normal ideographic. See [JISX4051]_ 6.1.1, [JLREQ]_ 3.1.7 or [UAX14]_. .. note:: |uds3095|, |uds3096|, |uds30F5| and |uds30F6| are considered to be neither hiragana nor katakana. ``{ KANA_PROLONGED_SOUND_MARKS: lbcID }`` Hiragana or katakana prolonged sound marks. |udl30FC| and |udlFF70|. ``{ MASU_MARK: lbcID }`` |udl303C|. .. note:: Although this character is not kana, it is usually regarded as abbreviation to sequence of hiragana |uc307E| |uc3059| or katakana |uc30DE| |uc30B9|, MA and SU. .. note:: This character is classified as non-starter (NS) by [UAX14]_ and as Class 13 (corresponding to ID) by [JISX4051]_ and [JLREQ]_. Ambiguous Quotation Marks ^^^^^^^^^^^^^^^^^^^^^^^^^ .. data:: BACKWARD_QUOTES .. data:: FORWARD_QUOTES .. data:: BACKWARD_GUILLEMETS .. data:: FORWARD_GUILLEMETS By default, some punctuations are :term:`ambiguous quotation mark`\ s (QU). ``{ BACKWARD_QUOTES: lbcOP, FORWARD_QUOTES: lbcCL }`` Some languages (Dutch, English, Italian, Portugese, Spanish, Turkish and most East Asian) use rotated-9-style punctuations (|uc2018| |uc201C|) as opening and 9-style punctuations (|uc2019| |uc201D|) as closing quotation marks. ``{ FORWARD_QUOTES: lbcOP, BACKWARD_QUOTES: lbcCL }`` Some others (Czech, German and Slovak) use 9-style punctuations (|uc2019| |uc201D|) as opening and rotated-9-style punctuations (|uc2018| |uc201C|) as closing quotation marks. ``{ BACKWARD_GUILLEMETS: lbcOP, FORWARD_GUILLEMETS: lbcCL }`` French, Greek, Russian etc. use left-pointing guillemets (|uc00AB| |uc2039|) as opening and right-pointing guillemets (|uc00BB| |uc203A|) as closing quotation marks. ``{ FORWARD_GUILLEMETS: lbcOP, BACKWARD_GUILLEMETS: lbcCL }`` German and Slovak use right-pointing guillemets (|uc00BB| |uc203A|) as opening and left-pointing guillemets (|uc00AB| |uc2039|) as closing quotation marks. Danish, Finnish, Norwegian and Swedish use 9-style or right-pointing punctuations (|uc2019| |uc201D| |uc00BB| |uc203A|) as both opening and closing quotation marks. East_Asian_Width Properties --------------------------- .. data:: AMBIGUOUS_ALPHABETICS .. data:: AMBIGUOUS_CYRILLIC .. data:: AMBIGUOUS_GREEK .. data:: AMBIGUOUS_LATIN Some particular letters of Latin, Greek and Cyrillic scripts have ambiguous (A) :term:`East_Asian_Width` property. Thus, these characters are treated as wide when :attr:`eastasian_context` attribute is true. Updating :attr:`eaw` attribute with following values, those characters are always treated as narrow. ``{ AMBIGUOUS_ALPHABETICS: eawN }`` Treat all of characters below as East_Asian_Width neutral (N). ``{ AMBIGUOUS_CYRILLIC: eawN }`` ``{ AMBIGUOUS_GREEK: eawN }`` ``{ AMBIGUOUS_LATIN: eawN }`` Treate letters having ambiguous (A) width of Cyrillic, Greek and Latin scripts as neutral (N). .. data:: QUESTIONABLE_NARROW_SIGNS On the other hand, despite several characters were occasionally rendered as wide characters by number of implementations for East Asian character sets, they are given narrow (Na) East_Asian_Width property just because they have fullwidth (F) compatibility characters. Updating :attr:`eaw` attribute with following values, those characters are treated as ambiguous --- wide when :attr:`eastasian_context` attribute is true. ``{ QUESTIONABLE_NARROW_SIGNS: eawA }`` |udl00A2|, |udl00A3|, |udl00A5| (or yuan sign), |udl00A6|, |udl00AC|, |udl00AF|. .. ..................................... .. .. below are substitution definitions .. ..................................... .. |udl00A2| unicode:: U +00A2 x20 U+00A2 x20 CENT x20 SIGN .. |udl00A3| unicode:: U +00A3 x20 U+00A3 x20 POUND x20 SIGN .. |udl00A5| unicode:: U +00A5 x20 U+00A5 x20 YEN x20 SIGN .. |udl00A6| unicode:: U +00A6 x20 U+00A6 x20 BROKEN x20 BAR .. |uc00AB| unicode:: U+00AB .. LEFT-POINTING DOUBLE ANGLE QUOTATION MARK .. |udl00AC| unicode:: U +00AC x20 U+00AC x20 NOT x20 SIGN .. |udl00AF| unicode:: U +00AF x20 U+00AF x20 MACRON .. |uc00BB| unicode:: U+00BB .. RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK .. |uc2018| unicode:: U+2018 .. LEFT SINGLE QUOTATION MARK .. |uc2019| unicode:: U+2019 .. RIGHT SINGLE QUOTATION MARK .. |uc201C| unicode:: U+201C .. LEFT DOUBLE QUOTATION MARK .. |uc201D| unicode:: U+201D .. RIGHT DOUBLE QUOTATION MARK .. |uc2039| unicode:: U+2039 .. SINGLE LEFT-POINTING ANGLE QUOTATION MARK .. |uc203A| unicode:: U+203A .. SINGLE RIGHT-POINTING ANGLE QUOTATION MARK .. |udl3005| unicode:: U +3005 x20 U+3005 x20 IDEOGRAPHIC x20 ITERATION x20 MARK .. |udl303B| unicode:: U +303B x20 U+303B x20 VERTICAL x20 IDEOGRAPHIC x20 ITERATION x20 MARK .. |udl303C| unicode:: U +303C x20 U+303C x20 MASU x20 MARK .. |uds3041| unicode:: U +3041 x20 U+3041 x20 "A" .. |uds3043| unicode:: U +3043 x20 U+3043 x20 "I" .. |uds3045| unicode:: U +3045 x20 U+3045 x20 "U" .. |uds3047| unicode:: U +3047 x20 U+3047 x20 "E" .. |uds3049| unicode:: U +3049 x20 U+3049 x20 "O" .. |uc3059| unicode:: U+3059 .. HIRAGANA LETTER SU .. |uds3063| unicode:: U +3063 x20 U+3063 x20 "TU" .. |uc307E| unicode:: U+307E .. HIRAGANA LETTER MA .. |uds3083| unicode:: U +3083 x20 U+3083 x20 "YA" .. |uds3085| unicode:: U +3085 x20 U+3085 x20 "YU" .. |uds3087| unicode:: U +3087 x20 U+3087 x20 "YO" .. |uds308E| unicode:: U +308E x20 U+308E x20 "WA" .. |uds3095| unicode:: U +3095 x20 U+3095 x20 "KA" .. |uds3096| unicode:: U +3096 x20 U+3096 x20 "KE" .. |udl309D| unicode:: U +309D x20 U+309D x20 HIRAGANA x20 ITERATION x20 MARK .. |udl309E| unicode:: U +309E x20 U+309E x20 HIRAGANA x20 VOICED x20 ITERATION x20 MARK .. |uds30A1| unicode:: U +30A1 x20 U+30A1 x20 "A" .. |uds30A3| unicode:: U +30A3 x20 U+30A3 x20 "I" .. |uds30A5| unicode:: U +30A5 x20 U+30A5 x20 "U" .. |uds30A7| unicode:: U +30A7 x20 U+30A7 x20 "E" .. |uds30A9| unicode:: U +30A9 x20 U+30A9 x20 "O" .. |uc30B9| unicode:: U+30B9 .. KATAKANA LETTER SU .. |uds30C3| unicode:: U +30C3 x20 U+30C3 x20 "TU" .. |uc30DE| unicode:: U+30DE .. KATAKANA LETTER MA .. |uds30E3| unicode:: U +30E3 x20 U+30E3 x20 "YA" .. |uds30E5| unicode:: U +30E5 x20 U+30E5 x20 "YU" .. |uds30E7| unicode:: U +30E7 x20 U+30E7 x20 "YO" .. |uds30EE| unicode:: U +30EE x20 U+30EE x20 "WA" .. |uds30F5| unicode:: U +30F5 x20 U+30F5 x20 "KA" .. |uds30F6| unicode:: U +30F6 x20 U+30F6 x20 "KE" .. |udl30FC| unicode:: U +30FC x20 U+30FC x20 KATAKANA-HIRAGANA x20 PROLONGED x20 SOUND x20 MARK .. |udl30FD| unicode:: U +30FD x20 U+30FD x20 KATAKANA x20 ITERATION x20 MARK .. |udl30FE| unicode:: U +30FE x20 U+30FE x20 KATAKANA x20 VOICED x20 ITERATION x20 MARK .. |uds31F0| unicode:: U +31F0 x20 U+31F0 x20 "KU" .. |uds31FF| unicode:: U +31FF x20 U+31FF x20 "RO" .. |udsFF67| unicode:: U +FF67 x20 U+FF67 x20 "A" .. |udsFF6F| unicode:: U +FF6F x20 U+FF6F x20 "TU" .. |udlFF70| unicode:: U +FF70 x20 U+FF70 x20 HALFWIDTH x20 KATAKANA-HIRAGANA x20 PROLONGED x20 SOUND x20 MARK