Letter TokenizerΒΆ

A tokenizer of type letter that divides text at non-letters. That’s to say, it defines tokens as maximal strings of adjacent letters. Note, this does a decent job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

Previous topic

Keyword Tokenizer

Next topic

Lowercase Tokenizer

This Page