Tokenizer


Service «Tokenizer» is intended to highlight tokens in the text. The text that requires tokenization is sent to the service input. After processing the text on the output, the user receives a list of the extended tokens.

 

Basic terms and concepts

Tokenization — selection of tokens in the text by a special computer program. This is the process of analyzing the input sequence of symbols into lexemes, in order to obtain the output of the identified sequences, the so-called «tokens». Lexical analysis is used in the compilers and interpreters of the source code of programming languages, and in various parsers of words of natural languages.

Token (lexical analysis) — a sequence of characters in lexical analysis in computer science, corresponding to a lexeme. An object that is created from a lexeme in the process of lexical analysis (tokenization).

Token template — a formal description of the class of tokens that can create this type of token.

 

User Interface Description

The user interface of the service is shown in Figure 1.

 

Figure 1. User interface of the service «Tokenizer»

 

The interface has the following areas:

 

  • text input field that requires tokenization;
  • language selection menu;
  • the button «Get list of tokens!», which starts tokenization;
  • output field where selected tokens are displayed.

User script for working with the service

  1. Enter the text that requires tokenization in the input field.
  2. Choose a language.
  3. Click the button «Get list of tokens!».
  4. View highlighted tokens (Figure 2).

 

Малюнак 2. Вынікі працы сэрвіса «Такенізатар»: вылучаныя токеныFigure 2. «Tokenizer» results example

 

Links to sources

Service page: https://corpus.by/Tokenizer/?lang=en

Калі Вы знайшлі ў тэксце памылку правапісу, калі ласка, выдзеліце гэты тэкст і націсніце Ctrl+Enter.