Tokenize UK¶

Simple python lib to tokenize texts into sentences and sentences to words. Small, fast and robust. Comes with ukrainian flavour

Features¶

Ukrainian tokenization script based on standard tokenization algorithm.

2016 (c) Vsevolod Dyomkin <vseloved@gmail.com>, Dmitry Chaplinsky <chaplinsky.dmitry@gmail.com>

tokenize_uk.tokenize_uk.tokenize_words(string)[source]¶

Tokenize input text to words.

Parameters:	string (str or unicode) – Text to tokenize
Returns:	words
Return type:	list of strings

tokenize_uk.tokenize_uk.tokenize_text(string)[source]¶

Tokenize input text to paragraphs, sentences and words.

Tokenization to paragraphs is done using simple Newline algorithm For sentences and words tokenizers above are used

Parameters:	string (str or unicode) – Text to tokenize
Returns:	text, tokenized into paragraphs, sentences and words
Return type:	list of list of list of words

tokenize_uk.tokenize_uk.tokenize_sents(string)[source]¶

Tokenize input text to sentences.

Parameters:	string (str or unicode) – Text to tokenize
Returns:	sentences
Return type:	list of strings