Part-of-Speech Tagger


Service “Part-of-Speech Tagger” allows the user to find out what part of speech belongs to a certain word online. The text in the Belarusian or Russian language should be given as input and as the output the user receives a list of words with indicated part of speech for every word of the text.

 

Basic terms and concepts

Parts of speech — word classes, which are characterized by common values, morphological traits, syntactic role. Part of speech can be allocated only on the basis of a set of specified criteria. Attention is paid to the following factors in determining the unit:

  • what it usually manifests (object, action, quality, etc.).;
  • in which grammatical form it may occur;
  • which word-formation means are typical for it;
  • what functions it performs in a sentence [1].

 

Practical value

The exact definition of the parts of speech of words in the text is important for understanding the meaning of a particular word, in case if the understanding depends on the part of speech. For example, the service can be used by translators when there are difficulties with the translation of a specific text with the word, which may belong to different parts of speech. It can also be used in the translation programs.

 

Service features

The service can use a number of dictionaries, each of which the user can choose by placing or removing the checkbox mark next to the name of the dictionary.

 

UI description

UI of the service is shown on the Figure 1.

Figure 1. UI of the service “Part-of-Speech Tagger”

On the service page, the user can enter text in which there should be determined the identity of the words to the parts of speech on one of the two languages (Belarusian, Russian). Also separately, you can add the known words, for which belonging to a particular part of speech can be accurately determined.

UI has the following areas:

  • text input area;
  • input area for known words with parts of speech, to which they belong;
  • output area of text in the form of words with parts of speech, to which the words refer to
  • output area for unknown words.

For receiving the words list with parts of speech, to which they belong, you need to click on “Show the list of words with parts of speech!”.

 

Use case of work with the service

  1. Enter text in the input field on the service page.
  2. In the “Known words” enter all known words with their parts of speech through the symbol “_” (Figure 1).
  3. In the selection of dictionaries, area indicates the necessary dictionaries (Figure 1).
  4. Click “Show the list of words with parts of speech!” to obtain the results (Figure 2).

Figure 2. The results of the service parts-of-speech identification

 

Access to the service via the API

To get information about the belonging of each of the words of the input text to a particular part of speech you should send a POST request to the AJAX-address http://corpus.by/PartOfSpeechTagger/api.php. With an array of parameters data the input text is passed (text parameter), as well as a list of words of user-defined parts of speech (knownList parameter), the resulting information separator (localDelimiter parameter), the need for dictionaries marker from which the information was taken (parameter dictionaryNames), the need to organize the summary information in one line (horizontalFormat parameter) and the number of markers use a particular dictionary.

The elements of the input array data have the following parameters:

  • text — arbitrary input text.
  • knownList — a list of words with user-defined parts of speech.
  • localDelimiter — resulting information separator.
  • dictionaryNames — dictionaries marker from which the information was taken.
  • horizontalFormat — the need to organize the summary information in one line; if the token is not marked, the information on each word is organized in separate lines.
  • Tokens of dictionaries usage:
    • sbm1987 — «Слоўнік беларускай мовы. Арфаграфія. Арфаэпія. Акцэнтуацыя. Словазмяненне / пад рэд. М.В. Бірылы. – Мінск, 1987»;
    • sbm2012initial —  «Слоўнік беларускай мовы. / навук. рэд. А.А. Лукашанец, В.П. Русак. – Мінск : Беларус. навука, 2012»;
    • zalizniak — «Грамматический словарь русского языка: Словоизменение / А.А. Зализняк. — Москва : Русский язык, 1980. — 880 c.»;
    • new — слоўнік сістэмы сінтэзу маўлення па тэксце;
    • S2016_01 — карыстальніцкі слоўнік беларускай мовы;
    • S2016_02 — карыстальніцкі слоўнік рускай мовы;
    • S2016_03 — карыстальніцкі слоўнік беларускай мовы.

Example of AJAX-request:

$.ajax({
   type: “POST”,
   url: “http://corpus.by/PartOfSpeechTagger/api.php”,
   data:{
      “text”: “Груша цвіла апошні грод.”,
      “knownList“: “груша_назоўнік цвіла_дзеяслоў”,
      “localDelimiter”: “_”,
      “dictionaryNames”: 1,
      “horizontalFormat”: 0,
      “sbm1987”: 1,
      “sbm2012initial”: 1
}
success: function(msg){ }
});

The server returns a JSON-Array with the input text (text parameter), the final list of words with information about their affiliation to one or another part of speech (result parameter) and the list of unknown words for the service (unknownWords parameter). For example, the following reply will be formed on the above-listed AJAX-request:

 

[
   {
      “text”: “Груша цвіла апошні грод.”,
      “result”: “груша_назоўнік_known
цвіла_дзеяслоў_known
апо+шні_JJMO_sbm1987_апо+шні_JJMA_sbm1987_апо+шні_невядомаяКатэгорыя_sbm2012initial
грод_НевядомаяЧасц
._ЗнакПрыпынку”,

      “unknownWords”: “грод”
   }
]

An example of using this API — Web-service  «Part-of-Speech Tagger via API» (http://corpus.by/PartOfSpeechTaggerViaApi/).

 

Links to sources

Service page: http://corpus.by/PartOfSpeechTagger/?lang=be

 

Crossed links

  1. Часціны мовы // Вікіпедыя [Electronic resource]. — 2017. Access mode : https://be.wikipedia.org/wiki/Часціны_мовы. — Date of access : 15.03.2017.

If you have found a spelling error, please, notify us by selecting that text and pressing Ctrl+Enter.