tokenize_uk package

Submodules

tokenize_uk.tokenize_uk module

Ukrainian tokenization script based on standard tokenization algorithm.

2016 (c) Vsevolod Dyomkin <vseloved@gmail.com>, Dmitry Chaplinsky <chaplinsky.dmitry@gmail.com>

tokenize_uk.tokenize_uk.tokenize_words(string)[source]

Tokenize input text to words.

Parameters:string (str or unicode) – Text to tokenize
Returns:words
Return type:list of strings
tokenize_uk.tokenize_uk.tokenize_text(string)[source]

Tokenize input text to paragraphs, sentences and words.

Tokenization to paragraphs is done using simple Newline algorithm For sentences and words tokenizers above are used

Parameters:string (str or unicode) – Text to tokenize
Returns:text, tokenized into paragraphs, sentences and words
Return type:list of list of list of words
tokenize_uk.tokenize_uk.tokenize_sents(string)[source]

Tokenize input text to sentences.

Parameters:string (str or unicode) – Text to tokenize
Returns:sentences
Return type:list of strings

Module contents