Tokenize UK¶
Simple python lib to tokenize texts into sentences and sentences to words. Small, fast and robust. Comes with ukrainian flavour
- Free software: MIT license
- Documentation: https://tokenize_uk.readthedocs.org.
Features¶
- Tokenize given text into sentences
- Tokenize given sentence into words
- Works well with accented characters (like stresses) and apostrophes
- Suitable also for other languages
API¶
Ukrainian tokenization script based on standard tokenization algorithm.
2016 (c) Vsevolod Dyomkin <vseloved@gmail.com>, Dmitry Chaplinsky <chaplinsky.dmitry@gmail.com>
-
tokenize_uk.tokenize_uk.
tokenize_words
(string)[source]¶ Tokenize input text to words.
Parameters: string (str or unicode) – Text to tokenize Returns: words Return type: list of strings
-
tokenize_uk.tokenize_uk.
tokenize_text
(string)[source]¶ Tokenize input text to paragraphs, sentences and words.
Tokenization to paragraphs is done using simple Newline algorithm For sentences and words tokenizers above are used
Parameters: string (str or unicode) – Text to tokenize Returns: text, tokenized into paragraphs, sentences and words Return type: list of list of list of words