The speech synthesis and recognition laboratory was created in 1974 initially as a department of the Central Scientific-Research Institute of Communications (CSRIC), and since 1986 it is a laboratory of the Institute of Technical Cybernetics of NAS of Belarus.The laboratory focuses on the following research areas: the theory of speech recognition and synthesis and the development of human-machine systems on the basis of speech communication.
.
The main research fields of the laboratory include:
- High-quality text-to-speech synthesis;
- Computer-assisted personal voice cloning;
- Multilingual speech synthesis;
- Robust recognition of sequences of discrete and run-together words;
- Computer telephony integration;
- Computer-assisted rehabilitation systems for people who are deaf, hard of hearing or are visually impaired;
- Computational linguistics;
- Natural language processing;
- Text pre-processing.
Scientific approaches and research methodology
High-quality multilingual and multi-voiced text-to-speech synthesis is based on natural speech allophone elements usage (around 1000 allophones) and on high level of specified male and female voices’ imitation. The task of synthetic speech “personalization” (computer cloning) has been successfully solved by satisfying the following conditions:
-
Maximally accurate modelling of acoustic, phonetic and prosodic features of an individual person’s voice and speech;
-
The lowest level of distortion of compilation elements in the process of their recording, playing and prosodic modification;
-
The absence of any additional transformation of speech elements of PSOLA type (abbreviated from: Pitch Synchronous Overlap and Add) or FFT type (abbreviated from: Fast Fourier Transform).
In order to solve the problem of natural language texts’ pre-processing, the linguistic development environment called Nooj is used (http://www.nooj-association.org/). This software allows to develop syntactic and morphological grammars, or so called finite automaton, and to test them on a large number of texts. For that purpose, the Belarusian module for NooJ was developed, which includes several texts, demo-versions of grammars and a set of dictionaries (http://www.nooj4nlp.org/resources/be.zip).
Basic algorithms for speech recognition and making verbal decisions are implemented on the basis of dynamic matching of signals, which is a new method proposed by the laboratory and modified for word recognition in connected speech. The method allows to carry out dynamic alignment of the time scales of word reference description and its realization in a speech flow, with the beginning and the end of the undefined word.
The main advantage of this method is the ability to determine the probability of a word’s presence in the running speech flow and the assessment of this word’s time position in the presence of different acoustic disturbances.
The solution to the problem of robust speech recognition is based on the implementation of two basic approaches:
-
The application of currently-known techniques of robust estimation of statistical parameters for solving specific problems related to analysis, feature extraction, training and speech recognition;
-
The application of collective recognition methods, where the final decision is based on the results of the collective recognition of appropriate rules with a different set of speech signal characteristics.