.. pytextseg documentation

=============
Customization
=============

.. _`Formatting Lines`:

Formatting Lines
================

If you specify callable object as format property of LineBreak object,
it should accept three arguments::

    callable_object(self, context, string) -> text_or_None

*self* is a LineBreak object,
*context* is a string to determine the context that subroutine was called in, 
and *string* is a fragment of Unicode string leading or trailing breaking 
position.

  +-----------+----------------------+----------------------------------+
  | *context* | When                 | Value of *string*                |
  +===========+======================+==================================+
  | ``"sot"`` | Beginning of text    | Fragment of first line           |
  +-----------+----------------------+----------------------------------+
  | ``"sop"`` | After mandatory break| Fragment of next line            |
  +-----------+----------------------+----------------------------------+
  | ``"sol"`` | After arbitrary break| Fragment on sequel of line       |
  +-----------+----------------------+----------------------------------+
  | ``""``    | Just before any      | Complete line without trailing   |
  |           | breaks               | SPACEs                           |
  +-----------+----------------------+----------------------------------+
  | ``"eol"`` | Arbitrary break      | SPACEs leading breaking position |
  +-----------+----------------------+----------------------------------+
  | ``"eop"`` | Mandatory break      | Newline and its leading SPACEs   |
  +-----------+----------------------+----------------------------------+
  | ``"eot"`` | End of text          | SPACEs (and newline) at end of   |
  |           |                      | text                             |
  +-----------+----------------------+----------------------------------+

Callable object should return modified text fragment or may return
``None`` to express that no modification occurred.
Note that modification in the context of ``"sot"``, ``"sop"`` or ``"sol"`` 
may affect decision of successive breaking positions while in the others 
won't.

.. note::
   String arguments are actually sequences of grapheme clusters.
   See documentation of GCStr class.

For example, following code folds lines removing trailing spaces::

    from textseg import LineBreak
    
    def format(self, event, string):
        if event.startswith('eo'):
            return "\n"
        return None
    
    lb = LineBreak(format = format)
    output = ''.join([str(s) for s in lb.wrap(text)])

.. _`User-Defined Breaking Behaviors`:

User-Defined Breaking Behaviors
===============================

When a line generated by arbitrary break is expected to be beyond measure of
either :attr:`charmax<textseg.LineBreak.charmax>`, 
:attr:`width<textseg.LineBreak.width>` 
or :attr:`minwidth<textseg.LineBreak.minwidth>`, **urgent break** may be
performed on successive string.
If you specify callable object as a value of 
:attr:`urgent<textseg.LineBreak.urgent>` attribute,
it should accept two arguments::

    callable_object(self, string) -> [text, ...]

*self* is a :class:`LineBreak<textseg.LineBreak>` object and *string* 
is a Unicode string to be broken.

Callable object should return a list of broken items of *string*.

.. note::
   String argument is actually a sequence of grapheme clusters.
   See :class:`GCStr<textseg.GCStr>` class.

For example, following code inserts hyphen to the name of several chemical 
substances (such as Titin) so that it may be folded::

    # Example not yet written

If you specify ``(regular expression, callable object[, flags])`` tuple as any 
item of :attr:`prep<textseg.LineBreak.prep>` option, callable object 
should accept two arguments::

    callable_object(self, string) -> [text, ...]

*self* is a :class:`LineBreak<textseg.LineBreak>` object and
*string* is a Unicode string matched with *regular expression*.

Callable object should return a list of broken items of *string*.

For example, following code will break HTTP URLs using [CMOS]_ rule::

    urire = re.compile(r'\b(?:url:)?http://[\x21-\x7E]+',
                       re.I + re.U)
    def breakURI(self, s):
        r = ''
        ret = []
        b = ''
        for c in s:
            if b == '':
                r = c
            elif r.lower().endswith('url:'):
                ret.append(r)
                r = c
            elif b in '/' and not c in '/' or \
                 not b in '-.' and c in '-~.,_?\#%=&' or \
                 b in '=&' or c in '=&':
                if r != '':
                    ret.append(r)
                r = c
            else:
                r += c
            b = c
        if r != '':
            ret.append(r)
        return ret

    output = fill(text, prep = [(urire, breakURI)])

.. versionchanged:: 0.1.1
   prep attribute accepts tuples with third item *flags*. 

.. _`Preserving State`:

Preserving State
----------------

:class:`LineBreak<textseg.LineBreak>` object can behave as dictionary.
Any items may be preserved throughout its life.

For example, following code will separate paragraphs with empty lines::

    # Example not yet written

.. _`Calculating String Size`:

Calculating String Size
=======================

If you specify callable object as a value of 
:attr:`sizing<textseg.LineBreak.sizing>` property,
it will be called with five arguments::

    callable_object(self, length, pre, spc, string) -> number_of_columns

*self* is a :class:`LineBreak<textseg.LineBreak>` object, 
*length* is size of preceding string,
*pre* is preceding Unicode string, *spc* is additional SPACEs and 
*string* is a Unicode string to be processed.

Callable object should return calculated number of columns of 
``pre + spc + string``.
The number of columns may not be an integer: Unit of the number may be 
freely chosen, however, it should be same as those of 
:attr:`minwidth<textseg.LineBreak.minwidth>` and 
:attr:`width<textseg.LineBreak.width>` 
properties.

.. note::
   String arguments are actually sequences of grapheme clusters.
   See :class:`GCStr<textseg.GCStr>` class.

For example, following code processes lines with tab stops by each eight
columns::

    from textseg import fill
    from textseg.Consts import lbcSP
    
    def sizing(self, cols, pre, spc, string):
        spcstr = spc + string
        i = 0
        for c in spcstr:
            if c.lbc != lbcSP:
                cols += spcstr[i:].cols
                break
            if c == "\t":
                cols += 8 - (cols % 8)
            else:
                cols += c.cols
            i = i + 1
        return cols
    
    output = fill(text, lbc = {ord("\t"): lbcSP}, sizing = sizing,
                  expand_tabs = False)

.. _`Tailoring Character Properties`:

Tailoring Character Properties
==============================

.. currentmodule:: textseg.Consts

Character properties may be tailored by :attr:`lbc<textseg.LineBreak.lbc>` and 
:attr:`eaw<textseg.LineBreak.eaw>` options.
Some constants are defined for convenience of tailoring.

Line Breaking Properties
------------------------

Non-starters of Kana-like Characters
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. data:: KANA_NONSTARTERS
.. data:: IDEOGRAPHIC_ITERATION_MARKS
.. data:: KANA_SMALL_LETTERS
.. data:: KANA_PROLONGED_SOUND_MARKS
.. data:: MASU_MARK

By default, several hiragana, katakana and characters corresponding to kana
are treated as :term:`non-starter`\ s (NS or CJ).
When the :attr:`lbc<textseg.LineBreak.lbc>` attribute is updated by
following items,
these characters are treated as normal :term:`ideographic character`\ s (ID).

``{ KANA_NONSTARTERS: lbcID }``
    All of characters below.

``{ IDEOGRAPHIC_ITERATION_MARKS: lbcID }``
    Ideographic iteration marks.
    |udl3005|, |udl303B|, |udl309D|, |udl309E|, |udl30FD| and |udl30FE|.

    .. note:: Some of them are neither hiragana nor katakana.

``{ KANA_SMALL_LETTERS: lbcID }``

    Hiragana or katakana small letters.

    Hiragana small letters:
    |uds3041|, |uds3043|, |uds3045|, |uds3047|, |uds3049|, |uds3063|,
    |uds3083|, |uds3085|, |uds3087|, |uds308E|, 
    |uds3095|, |uds3096|.

    Katakana small letters:
    |uds30A1|, |uds30A3|, |uds30A5|, |uds30A7|, |uds30A9|, |uds30C3|,
    |uds30E3|, |uds30E5|, |uds30E7|, |uds30EE|, 
    |uds30F5|, |uds30F6|.

    Katakana phonetic extensions:
    |uds31F0| - |uds31FF|.

    Halfwidth katakana small letters:
    |udsFF67| - |udsFF6F|.

    .. note:: These letters and prolonged sound marks below are optionally
       treated either as non-starter or as normal ideographic.
       See [JISX4051]_ 6.1.1, [JLREQ]_ 3.1.7 or [UAX14]_.

    .. note:: |uds3095|, |uds3096|, |uds30F5| and |uds30F6| are considered 
       to be neither hiragana nor katakana.

``{ KANA_PROLONGED_SOUND_MARKS: lbcID }``

    Hiragana or katakana prolonged sound marks.
    |udl30FC| and |udlFF70|.

``{ MASU_MARK: lbcID }``
    |udl303C|.

    .. note:: Although this character is not kana, it is usually regarded as
       abbreviation to sequence of hiragana |uc307E| |uc3059| or
       katakana |uc30DE| |uc30B9|, MA and SU.

    .. note:: This character is classified as non-starter (NS) by [UAX14]_
       and as Class 13 (corresponding to ID) by [JISX4051]_ and [JLREQ]_.

Ambiguous Quotation Marks
^^^^^^^^^^^^^^^^^^^^^^^^^

.. data:: BACKWARD_QUOTES
.. data:: FORWARD_QUOTES
.. data:: BACKWARD_GUILLEMETS
.. data:: FORWARD_GUILLEMETS

By default, some punctuations are :term:`ambiguous quotation mark`\ s (QU).

``{ BACKWARD_QUOTES: lbcOP, FORWARD_QUOTES: lbcCL }``
    Some languages (Dutch, English, Italian, Portugese, Spanish, Turkish and
    most East Asian) use rotated-9-style punctuations (|uc2018| |uc201C|) as
    opening and 9-style punctuations (|uc2019| |uc201D|) as closing quotation
    marks.

``{ FORWARD_QUOTES: lbcOP, BACKWARD_QUOTES: lbcCL }``
    Some others (Czech, German and Slovak) use 9-style punctuations
    (|uc2019| |uc201D|) as opening and rotated-9-style punctuations
    (|uc2018| |uc201C|) as closing quotation marks.

``{ BACKWARD_GUILLEMETS: lbcOP, FORWARD_GUILLEMETS: lbcCL }``
    French, Greek, Russian etc. use left-pointing guillemets (|uc00AB| |uc2039|)
    as opening and right-pointing guillemets (|uc00BB| |uc203A|) as closing
    quotation marks.

``{ FORWARD_GUILLEMETS: lbcOP, BACKWARD_GUILLEMETS: lbcCL }``
    German and Slovak use right-pointing guillemets (|uc00BB| |uc203A|) as
    opening and left-pointing guillemets (|uc00AB| |uc2039|) as closing
    quotation marks.

Danish, Finnish, Norwegian and Swedish use 9-style or right-pointing
punctuations (|uc2019| |uc201D| |uc00BB| |uc203A|) as both opening and
closing quotation marks.

East_Asian_Width Properties
---------------------------

.. data:: AMBIGUOUS_ALPHABETICS
.. data:: AMBIGUOUS_CYRILLIC
.. data:: AMBIGUOUS_GREEK
.. data:: AMBIGUOUS_LATIN

Some particular letters of Latin, Greek and Cyrillic scripts have ambiguous
(A) :term:`East_Asian_Width` property.  Thus, these characters are treated 
as wide when :attr:`eastasian_context<textseg.LineBreak.eastasian_context>` 
attribute is true.
Updating :attr:`eaw<textseg.LineBreak.eaw>` attribute with following values,
those characters are always treated as narrow.

``{ AMBIGUOUS_ALPHABETICS: eawN }``
    Treat all of characters below as East_Asian_Width neutral (N).

``{ AMBIGUOUS_CYRILLIC: eawN }``

``{ AMBIGUOUS_GREEK: eawN }``

``{ AMBIGUOUS_LATIN: eawN }``
    Treate letters having ambiguous (A) width of Cyrillic, Greek and Latin 
    scripts as neutral (N).

.. data:: QUESTIONABLE_NARROW_SIGNS

On the other hand, despite several characters were occasionally rendered as 
wide characters by number of implementations for East Asian character sets, 
they are given narrow (Na) East_Asian_Width property just because they have 
fullwidth (F) compatibility characters.
Updating :attr:`eaw<textseg.LineBreak.eaw>` attribute with 
following values, those characters are treated as ambiguous --- 
wide when :attr:`eastasian_context<textseg.LineBreak.eastasian_context>` 
attribute is true.

``{ QUESTIONABLE_NARROW_SIGNS: eawA }``
    |udl00A2|, |udl00A3|, |udl00A5| (or yuan sign),
    |udl00A6|, |udl00AC|, |udl00AF|.

.. .....................................
.. .. below are substitution definitions
.. .....................................

.. |udl00A2| unicode:: U +00A2 x20 U+00A2 x20 CENT x20 SIGN
.. |udl00A3| unicode:: U +00A3 x20 U+00A3 x20 POUND x20 SIGN
.. |udl00A5| unicode:: U +00A5 x20 U+00A5 x20 YEN x20 SIGN
.. |udl00A6| unicode:: U +00A6 x20 U+00A6 x20 BROKEN x20 BAR
.. |uc00AB|  unicode:: U+00AB .. LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
.. |udl00AC| unicode:: U +00AC x20 U+00AC x20 NOT x20 SIGN
.. |udl00AF| unicode:: U +00AF x20 U+00AF x20 MACRON
.. |uc00BB|  unicode:: U+00BB .. RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
.. |uc2018|  unicode:: U+2018 .. LEFT SINGLE QUOTATION MARK
.. |uc2019|  unicode:: U+2019 .. RIGHT SINGLE QUOTATION MARK
.. |uc201C|  unicode:: U+201C .. LEFT DOUBLE QUOTATION MARK
.. |uc201D|  unicode:: U+201D .. RIGHT DOUBLE QUOTATION MARK
.. |uc2039|  unicode:: U+2039 .. SINGLE LEFT-POINTING ANGLE QUOTATION MARK
.. |uc203A|  unicode:: U+203A .. SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
.. |udl3005| unicode:: U +3005 x20 U+3005 x20 IDEOGRAPHIC x20 ITERATION x20 MARK
.. |udl303B| unicode:: U +303B x20 U+303B x20 VERTICAL x20 IDEOGRAPHIC x20 ITERATION x20 MARK
.. |udl303C| unicode:: U +303C x20 U+303C x20 MASU x20 MARK
.. |uds3041| unicode:: U +3041 x20 U+3041 x20 "A"
.. |uds3043| unicode:: U +3043 x20 U+3043 x20 "I"
.. |uds3045| unicode:: U +3045 x20 U+3045 x20 "U"
.. |uds3047| unicode:: U +3047 x20 U+3047 x20 "E"
.. |uds3049| unicode:: U +3049 x20 U+3049 x20 "O"
.. |uc3059|  unicode:: U+3059 .. HIRAGANA LETTER SU
.. |uds3063| unicode:: U +3063 x20 U+3063 x20 "TU"
.. |uc307E|  unicode:: U+307E .. HIRAGANA LETTER MA
.. |uds3083| unicode:: U +3083 x20 U+3083 x20 "YA"
.. |uds3085| unicode:: U +3085 x20 U+3085 x20 "YU"
.. |uds3087| unicode:: U +3087 x20 U+3087 x20 "YO"
.. |uds308E| unicode:: U +308E x20 U+308E x20 "WA"
.. |uds3095| unicode:: U +3095 x20 U+3095 x20 "KA"
.. |uds3096| unicode:: U +3096 x20 U+3096 x20 "KE"
.. |udl309D| unicode:: U +309D x20 U+309D x20 HIRAGANA x20 ITERATION x20 MARK
.. |udl309E| unicode:: U +309E x20 U+309E x20 HIRAGANA x20 VOICED x20 ITERATION x20 MARK
.. |uds30A1| unicode:: U +30A1 x20 U+30A1 x20 "A"
.. |uds30A3| unicode:: U +30A3 x20 U+30A3 x20 "I"
.. |uds30A5| unicode:: U +30A5 x20 U+30A5 x20 "U"
.. |uds30A7| unicode:: U +30A7 x20 U+30A7 x20 "E"
.. |uds30A9| unicode:: U +30A9 x20 U+30A9 x20 "O"
.. |uc30B9|  unicode:: U+30B9 .. KATAKANA LETTER SU
.. |uds30C3| unicode:: U +30C3 x20 U+30C3 x20 "TU"
.. |uc30DE|  unicode:: U+30DE .. KATAKANA LETTER MA
.. |uds30E3| unicode:: U +30E3 x20 U+30E3 x20 "YA"
.. |uds30E5| unicode:: U +30E5 x20 U+30E5 x20 "YU"
.. |uds30E7| unicode:: U +30E7 x20 U+30E7 x20 "YO"
.. |uds30EE| unicode:: U +30EE x20 U+30EE x20 "WA"
.. |uds30F5| unicode:: U +30F5 x20 U+30F5 x20 "KA"
.. |uds30F6| unicode:: U +30F6 x20 U+30F6 x20 "KE"
.. |udl30FC| unicode:: U +30FC x20 U+30FC x20 KATAKANA-HIRAGANA x20 PROLONGED x20 SOUND x20 MARK
.. |udl30FD| unicode:: U +30FD x20 U+30FD x20 KATAKANA x20 ITERATION x20 MARK
.. |udl30FE| unicode:: U +30FE x20 U+30FE x20 KATAKANA x20 VOICED x20 ITERATION x20 MARK
.. |uds31F0| unicode:: U +31F0 x20 U+31F0 x20 "KU"
.. |uds31FF| unicode:: U +31FF x20 U+31FF x20 "RO"
.. |udsFF67| unicode:: U +FF67 x20 U+FF67 x20 "A"
.. |udsFF6F| unicode:: U +FF6F x20 U+FF6F x20 "TU"
.. |udlFF70| unicode:: U +FF70 x20 U+FF70 x20 HALFWIDTH x20 KATAKANA-HIRAGANA x20 PROLONGED x20 SOUND x20 MARK