Tokenize UK

https://img.shields.io/pypi/v/tokenize_uk.svg https://img.shields.io/travis/lang-uk/tokenize-uk.svg Documentation Status

Simple python lib to tokenize texts into sentences and sentences to words. Small, fast and robust. Comes with ukrainian flavour

Features

  • Tokenize given text into sentences
  • Tokenize given sentence into words
  • Works well with accented characters (like stresses) and apostrophes
  • Suitable also for other languages

API

Ukrainian tokenization script based on standard tokenization algorithm.

2016 (c) Vsevolod Dyomkin <vseloved@gmail.com>, Dmitry Chaplinsky <chaplinsky.dmitry@gmail.com>

tokenize_uk.tokenize_uk.tokenize_words(string)[source]

Tokenize input text to words.

Parameters:string (str or unicode) – Text to tokenize
Returns:words
Return type:list of strings
tokenize_uk.tokenize_uk.tokenize_text(string)[source]

Tokenize input text to paragraphs, sentences and words.

Tokenization to paragraphs is done using simple Newline algorithm For sentences and words tokenizers above are used

Parameters:string (str or unicode) – Text to tokenize
Returns:text, tokenized into paragraphs, sentences and words
Return type:list of list of list of words
tokenize_uk.tokenize_uk.tokenize_sents(string)[source]

Tokenize input text to sentences.

Parameters:string (str or unicode) – Text to tokenize
Returns:sentences
Return type:list of strings