Package Contents

textseg module

The pytextseg package provides functions to wrap plain texts: fill() and wrap() are Unicode-aware alternatives for those of textwrap standard module; fold() and unfold() are functions mainly focus on plain text messages such as e-mail.

It also provides lower level interfaces for text segmentation: LineBreak class for line breaking; GCStr class for grapheme cluster segmentation.

If you are inpatient, see “Functions”.

Functions

textseg.fold(string, method='plain', tabsize=8, charset=None, language=None, **kwds)[source]

fold(string[, method, options...]) -> unicode

Fold lines of string string to fit in lines of no more than width columns, and return it.

Following options may be specified for method argument.

"fixed"
Lines preceded by “>” won’t be folded. Paragraphs are separated by empty line.
"flowed"
“Format=Flowed; DelSp=Yes” formatting defined by RFC 3676.
"plain"
Default method. All lines are folded.

Surplus SPACEs and horizontal tabs at end of line are removed, newline sequences are replaced by that specified by optional newline argument and newline is appended at end of text if it does not exist. Horizontal tabs are treated as tab stops according to tabsize argument.

charset or language is used to determine language/region context: East Asian or not.

For other named arguments see instance attributes of LineBreak class.

textseg.unfold(string, method='fixed', newline='\n', **kwds)[source]

unfold(text[, method]) -> unicode

Conjunct folded paragraphs of string STRING and returns it. Following options may be specified for method argument.

"fixed"
Default method. Lines preceded by ">" won’t be conjuncted. Treat empty line as paragraph separator.
"flowed"
Unfold “Format=Flowed; DelSp=Yes” formatting defined by RFC 3676.
"flowedsp"
Unfold “Format=Flowed; DelSp=No” formatting defined by RFC 3676.

textwrap Style Functions

textseg.fill(text, **kwds)[source]

fill(text[, options...]) -> unicode

Reformat the single paragraph in text to fit in lines of no more than width columns, and return a new string containing the entire wrapped paragraph. Optional named arguments will be passed to wrap function.

textseg.wrap(text, width=70, initial_indent='', subsequent_indent='', expand_tabs=True, replace_whitespace=True, fix_sentence_endings=False, break_long_words=True, break_on_hyphens=True, drop_whitespace=True, **kwds)[source]

wrap(text[, options...]) -> [unicode]

Wrap paragraphs of a text then return a list of wrapped lines.

Reformat each paragraph in text so that it fits in lines of no more than width columns if possible, and return a list of wrapped lines. By default, tabs in text are expanded and all other whitespace characters (including newline) are converted to space.

See textwrap about options.

Note

Some options take no effects on this module: fix_sentence_endings, break_on_hyphens, drop_whitespace.

For other named arguments see instance attributes of LineBreak class.

GCStr class

class textseg.GCStr[source]

GCStr class treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 ([UAX29]).

static __new__(string, lb=None)[source]

GCStr(string[, lb]) -> GCStr

Create new grapheme cluster string (GCStr object) from Unicode string string.

Optional LineBreak object lb controls breaking features. Following attributes of LineBreak object affect new GCStr object.

center(width, fillchar=' ')[source]

S.center(width[, fillchar]) -> GCStr

Return S centered in a string of width columns. Padding is done using the specified fill character (default is a space)

endswith(suffix, start=0, end=None)[source]

S.endswith(suffix[, start[, end]]) -> bool

Return True if S ends with the specified suffix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. suffix can also be a tuple of strings to try.

expandtabs(tabsize=8)[source]

S.expandtabs([tabsize]) -> GCStr

Return a copy of S where all tab characters are expanded using spaces. If tabsize is not given, a tab size of 8 columns is assumed.

join(iterable)[source]

S.join(iterable) -> GCStr

Return a grapheme cluster string which is the concatenation of the strings in the iterable. The separator between elements is S.

ljust(width, fillchar=' ')[source]

S.ljust(width[, fillchar]) -> GCStr

Return S left-justified in a grapheme cluster string of width columns. Padding is done using the specified fill character (default is a space).

rjust(width, fillchar=' ')[source]

S.rjust(width[, fillchar]) -> GCStr

Return S right-justified in a string of width columns. Padding is done using the specified fill character (default is a space).

splitlines(keepends=False)[source]

S.splitlines([keepends]) -> [GCStr]

Return a list of the lines in S, breaking at line boundaries. Line breaks are not included in the resulting list unless keepends is given and true.

Note

U+001C, U+001D and U+001E are not included in linebreak characters.

startswith(prefix, start=0, end=None)[source]

S.startswith(prefix[, start[, end]]) -> bool

Return True if S starts with the specified prefix, False otherwise. With optional start, test S beginning at that position. With optional end, stop comparing S at that position. prefix can also be a tuple of strings to try.

translate(table)

Deprecated since version 0.1.0: See “Methods not Supported”.

String Operations

Most of operations for string object are available on GCStr object.

Operation Result Notes
x in s True if s contains a grapheme cluster x, else False (1)
x not in s False if s contains a grapheme cluster x, else True (1)
s + t the concatenation of s and t (2) (3)
s * n, n * s n copies of s concatenated (3)
s[i] ith grapheme cluster of s, origin 0  
s[i:j] slice of s from i to j  
s[i:j:k] slice of s from i to j with step k  
len(s) number of grapheme clusters s contains (4)
min(s) smallest grapheme cluster of s  
max(s) largest grapheme cluster of s  
s < t strictly less than (5)
s <= t less than or equal (5)
s > t strictly greater than (5)
s >= t greater than or equal (5)
s == t equal (5)
s != t not equal (5)
str(s), unicode(s) string representation of object. unicode() is used by Python 2.x.  

Notes:

  1. x may be Unicode string.
  2. One of operands may be Unicode string.
  3. Note that number of columns (see cols) or grapheme clusters (see len()) of resulting grapheme cluster string is not always equal to sum of both strings.
  4. See also chars and cols attributes.
  5. Comparisons are performed by Unicode string value, not concerning grapheme cluster boundaries.

GCStr object can not be operand of re regular expression operations.

Methods not Supported

Some string methods are not supported since they break grapheme cluster boundaries. Instead, use methods of stringified objects. For example:

# For Python 3
result = gcs * 0 + str(gcs).translate(table)
# For Python 2
result = gcs * 0 + unicode(gcs).translate(table)

gcs * 0 + ... is a convenient way to recalculate grapheme clusters.

Instance Attributes

These attributes are read-only.

chars

Number of Unicode characters grapheme cluster string includes, i.e. length as Unicode string.

cols

Total number of columns of grapheme clusters defined by built-in character database. For more details see documentations of LineBreak class.

lbc

Line breaking class of the first character of first grapheme cluster.

lbcext

Line breaking class of last grapheme extender of last grapheme cluster. If there are no grapheme extenders or its class is CM, value of last grapheme base will be returned.

LineBreak class

class textseg.LineBreak(**kwds)[source]

LineBreak class performs Line Breaking Algorithm described in Unicode Standard Annex #14 ([UAX14]). East_Asian_Width informative properties defined by Annex #11 ([UAX11]) will be concerned to determine breaking positions.

__init__(**kwds)[source]

LineBreak([options...]) -> LineBreak

Create new LineBreak object. Optional named arguments may specify initial attribute values. See documentations of instance attributes. Initial defaults are:

break_indent=False, charmax=998, eastasian_context=False, eaw=None, format=”SIMPLE”, hangul_as_al=False, lbc=None, legacy_cm=True, minwidth=0, newline=”\n”, prep=[None], sizing=”UAX11”, urgent=None, virama_as_joiner=True, width=70
breakingRule(before, after)

S.rule(before, after) -> int

Get possible line breaking behavior between strings before and after. Returned value is one of:

MANDATORY
Mandatory break.
DIRECT
Both direct break and indirect break are allowed.
INDIRECT
Indirect break is allowed but direct break is prohibited.
PROHIBITED
Breaking is prohibited.

Following instance attributes of LineBreak object S will affect to result.

Note

This method gives just approximate description of line breaking behavior. Use wrap method or other functions to fold actual texts.

wrap(text)

S.wrap(text) -> [GCStr]

Break a Unicode string text and returns list of lines contained in the result. Each item of list is grapheme cluster string (GCStr object).

Class Attributes

DEFAULTS

Dictionary containing default values of instance attributes.

MANDATORY
DIRECT
INDIRECT
PROHIBITED

Four values to specify line breaking behaviors: Mandatory break; Both direct break and indirect break are allowed; Indirect break is allowed but direct break is prohibited; Prohibited break.

Instance Attributes

About default values of these attributes see __init__().

break_indent

Always allows break after SPACEs at beginning of line, a.k.a. indent. [UAX14] does not take account of such usage of SPACE.

charmax

Possible maximum number of characters in one line, not counting trailing SPACEs and newline sequence. Note that number of characters generally doesn’t represent length of line. 0 means unlimited.

complex_breaking

Performs heuristic breaking on South East Asian complex context. If word segmentation for South East Asian writing systems is not enabled, this does not have any effect.

eastasian_context

Enable East Asian language/region context. If it is true, characters assigned to line breaking class AI will be treated as ideographic characters (ID) and East_Asian_Width A (ambiguous) will be treated as F (fullwidth). Otherwise, they are treated as alphabetic characters (AL) and N (neutral), respectively.

eaw

Tailor classification of East_Asian_Width property defined by [UAX11]. Value may be a dictionary with its keys are Unicode string or UCS scalar and with its values are any of East_Asian_Width properties (see documentation of textseg.Consts module). If None is specified, all tailoring assigned before will be canceled. By default, no tailorings are available. See also “Tailoring Character Properties”.

format

Specify the method to format broken lines.

"SIMPLE"
Just only insert newline at arbitrary breaking positions.
"NEWLINE"
Insert or replace newline sequences with that specified by newline option, remove SPACEs leading newline sequences or end-of-text. Then append newline at end of text if it does not exist.
"TRIM"
Insert newline at arbitrary breaking positions. Remove SPACEs leading newline sequences.
None
Do nothing, even inserting any newlines.
callable object
See “Formatting Lines”.
hangul_as_al

Treat hangul syllables and conjoining jamo as alphabetic characters (AL).

lbc

Tailor classification of line breaking property defined by [UAX14]. Value may be a dictionary with its keys are Unicode string or UCS scalar and its values with any of line breaking classes (See Consts module). If None is specified, all tailoring assigned before will be canceled. By default, no tailorings are available. See also “Tailoring Character Properties”.

legacy_cm

Treat combining characters lead by a SPACE as an isolated combining character (ID). As of Unicode 5.0, such use of SPACE is not recommended.

minwidth

Minimum number of columns which line broken arbitrarily may include, not counting trailing spaces and newline sequences.

newline

Unicode string to be used for newline sequence. It may be None.

prep

Add user-defined line breaking behavior(s). Value shall be list of items described below.

"NONBREAKURI"
Won’t break URIs.
"BREAKURI"
Break URIs according to a rule suitable for printed materials. For more details see [CMOS], sections 6.17 and 17.11.
(regex, callable object[, flags])
The sequences matching regex will be broken by callable object. If regex is a string, not a regex object, flags may be specified. For more details see “User-Defined Breaking Behaviors”.
None
Cancel all methods assigned before.
sizing

Specify method to calculate size of string. Following options are available.

"UAX11"
Sizes are computed by columns of each characters.
None
Number of grapheme clusters (See documentation of GCStr class) contained in the string.
callable object
See “Calculating String Size”.

See also eaw attribute.

urgent

Specify method to handle excessing lines. Following options are available.

"RAISE"
Raise a LineBreakException exception.
"FORCE"
Force breaking excessing fragment.
None
Won’t break excessing fragment.
callable object
See “User-Defined Breaking Behaviors”.
virama_as_joiner

Virama sign (“halant” in Hindi, “coeng” in Khmer) and its succeeding letter are not broken. “Default” grapheme cluster defined by [UAX29] does not contain this feature.

width

Maximum number of columns line may include not counting trailing spaces and newline sequence. In other words, recommended maximum length of line.

Exception

exception textseg.LineBreakException[source]

See urgent attribute of LineBreak class.

textseg.Consts module

Constants for textseg package.

textseg.Consts.eawNa
textseg.Consts.eawN
textseg.Consts.eawA
textseg.Consts.eawW
textseg.Consts.eawH
textseg.Consts.eawF
textseg.Consts.eawZ

Index values to specify six East_Asian_Width properties defined by [UAX #11], and eawZ to specify nonspacing.

Note

Property value Z is non-standard.

textseg.Consts.lbcBK
textseg.Consts.lbcCR
textseg.Consts.lbcLF
textseg.Consts.lbcNL
textseg.Consts.lbcSP
textseg.Consts.lbcOP
textseg.Consts.lbcCL
textseg.Consts.lbcCP
textseg.Consts.lbcQU
textseg.Consts.lbcGL
textseg.Consts.lbcNS
textseg.Consts.lbcEX
textseg.Consts.lbcSY
textseg.Consts.lbcIS
textseg.Consts.lbcPR
textseg.Consts.lbcPO
textseg.Consts.lbcNU
textseg.Consts.lbcAL
textseg.Consts.lbcHL
textseg.Consts.lbcID
textseg.Consts.lbcIN
textseg.Consts.lbcHY
textseg.Consts.lbcBA
textseg.Consts.lbcBB
textseg.Consts.lbcB2
textseg.Consts.lbcCB
textseg.Consts.lbcZW
textseg.Consts.lbcCM
textseg.Consts.lbcWJ
textseg.Consts.lbcH2
textseg.Consts.lbcH3
textseg.Consts.lbcJL
textseg.Consts.lbcJV
textseg.Consts.lbcJT
textseg.Consts.lbcSG
textseg.Consts.lbcAI
textseg.Consts.lbcCJ
textseg.Consts.lbcSA
textseg.Consts.lbcXX

Index values to specify 39 line breaking properties (classes) defined by [UAX #14].

Note

Property value CP was introduced by Unicode 5.2.0. Property value HL and CJ were introduced by Unicode 6.1.0.

textseg.Consts.sea_support

Flag to determin if word segmentation for South East Asian writing systems is enabled. If this feature was enabled, a non-empty string is set. Otherwise, None is set.

Note

Current release supports Thai script of modern Thai language only.

textseg.Consts.unicode_version

A string to specify version of Unicode Standard this module refers.

See also “Tailoring Character Properties”.

Table Of Contents

Previous topic

Installing pytextseg

Next topic

Customization

This Page