SB: “Synthesis check by listening”


To deceive the eyes is easier than ears. When Juraś Hiecevič enters the sentence «Mother washed dishes» on the keyboard, “the talking head” immediately announces it on the monitor. The vision shows that facial expressions on the other side of the screen are absolutely reliable. But the ears still doubt the intonation. The differences between computer and natural speech are more evident while playing a large piece of text – there is not enough emotion. At first, it is difficult to understand anything, but when you set your mind to, the essence of what has been said is easily perceived.

Nevertheless, the work on the creation of perfect, natural speech is one of the main problems to be solved today for scientists under the direction of Acting Head of speech synthesis and recognition laboratory of United Institute of Informatics Problems, National Academy of Sciences of Belarus Juraś Hiecevič.

Intonation is good, but not perfect

It would seem that talking computer and a mobile phone are not the latest things. Everyone can get programs which voice the text. However, commercial companies, adapting the idea of speech synthesis for applications in specific areas, don’t get ahead of themselves. They use only tested scientific results.

Juraś Hiecevič notes that all synthesizers on the market are deficient of intonation – long sentences are pronounced, so that after one or two pages of text it is impossible to listen to. They have a relatively small number of intonation contours and rules for their application and therefore, inaccuracy is appeared in a voice variant. As a rule, companies are waiting for new scientific publications and only then take the latest developments into service. Thus, for example, it happened with numbers processing and pronunciation in the text, which was rarity a few years ago. But today it is a common option.

Text-to-speech synthesis is one of the most difficult areas with a great deal of work on it. For example, Juraś Hiecevič is working out the mechanism so that the machine can understand and correctly read aloud the different combinations of numbers and letters, abbreviations, acronyms, automatically put the stress in new (unknown to the synthesizer) words. Because not all people adhere to the rules for writing. His thesis is devoted to the linguistic processing of text-to-speech synthesizer: “We even can’t put the stress to unknown surnames. How to teach the machine to look for such decisions? There is even more interesting task: homographs. There are not so many homographs In Russian and Belarusian, about 10 thousand, but they spoil the picture! How a computer will understand correct word “приобретает все бОльшую популярность или большУю”? I know triple homographs in Belarusian. For example, «прыгожая казачка распавяла казачку свайму казачку»… We have a system that looks for homographs, but sometimes we still face the fact that the machine is not able to perceive the meaning, context”. That is why the improvement of speech synthesis is a problem of the same level as the creation of artificial intelligence.

Word composition

Since the youth innovation forum of the National Academy of Sciences where the project of Juraś Hiecevič and Dzmicier Pakladok “Text-to-speech synthesizer for Russian and Belarusian for stationary and mobile platforms” was considered the best, a lot of things happened. It was presented at a conference on artificial intelligence OSTIS-2012, participated in the innovation week, received a diploma at “TIBO-2012”. They are invited to exhibitions constantlY. After all, these young scientists have learned computer and mobile phone to speak Belarusian. Previously, there were no Belarusian synthesizers at all!

To make a computer speak is a huge, painstaking work. You need to record the voice of a real person, then this record is decomposed in a special program which shows the smallest fluctuations in sound, cut into “details” – allophones (the smallest variations of phonemes)  because the same letter “a” in stressed and unstressed syllables is pronounced differently. As a result, the base contains thousands of allophones. And then the algorithms are developed. They remove parts of words from the base, which are necessary to play, then join smallest parts to the word. It is important to note that the speaker doesn’t have to record a large text. Scientists have developed a special well-balanced text for six minutes readings, which has all the necessary phonemes.

Of course, the software that translates text files into sound files, should have a comprehensive dictionary and its replenishment system – the brainchild of Juraś Hiecevič, operates more than two million words of the Russian and the Belarusian languages.

You don’t need iPhones

40 years ago the first in the eastern European space who began to teach computers to pronounce typed text was Barys Labanaŭ, chief researcher at the speech synthesis and recognition Laboratory of United Institute of Informatics Problems. He created the basis on which the speech synthesis is improved now in Belarus and even in Russia – by the way, for the most part by students of Barys Miefodzjevič. Juraś Hiecevič is one of them. He pulls out an old mobile phone with the words: “I keep it for all not to think that our programs need iPhones. This is an experimental model of mobile speech-to-text synthesizer, made in our laboratory. It requires only two megabytes of memory and therefore can work on the basic devices”. And a synthesized voice starts reading “Zorka Vieniera”. It can also voice text messages and caller’s name. You need only text!

40 years ago the first in the eastern European space who began to teach computers to pronounce typed text was Barys Labanaŭ, chief researcher at the speech synthesis and recognition Laboratory of United Institute of Informatics Problems. He created the basis on which the speech synthesis is improved now in Belarus and even in Russia – by the way, for the most part by students of Barys Miefodzjevič. Juraś Hiecevič is one of them. He pulls out an old mobile phone with the words: “I keep it for all not to think that our programs need iPhones. This is an experimental model of mobile speech-to-text synthesizer, made in our laboratory. It requires only two megabytes of memory and therefore can work on the basic devices”. And a synthesized voice starts reading “Star Venera”. It can also voice text messages and caller’s name. You need only text!

A computer system which creates audiobooks is also developed. Recently with students we voiced the textbook “Social Science” for 10th grade. It took only about a week. Students said that this was a really good practice that they have never had before. “Talking Library” already exists. For example, it works in Molodechno school for children with visual impairments. In general, for those who have problems with vision, speech synthesis program is a godsend. Braille Books are expensive, not to mention the fact that you will not find the literary novelties among them. Our program will transfer text into an audio version any product, the electronic version of which is in the network. Created program also is useful for those who need to learn how to speak, for example, after a stroke: “talking head” pronounces the word on the monitor, the mimics can be played even in slow motion with the possibility to imitate her. And the latest development makes the synthesizer applicable for alerting services: it is enough to enter necessary information, and the voice will announce when and on which line the train is coming or what the next stop is at the trolley. Or one new product is a phone robot. It calls dozens of numbers of subscribers and reports on the debt, indicating the specific amount, as long as the data are in the computer.

The nearest plan of scientists is the creation of the Internet version of the text-to-speech synthesizer. It is likely to be that the first “talk” website will be the site of the National Library. Then, any visitor will be able to use voice search of the book. The entire text which gets over “mouse”, will be voiced,  whether it columns, tabs sections. In general, it is a great amount of practical applications which allow to use text-to-speech synthesizer for the education, rehabilitation, in the banking system, transport, housing and communal services. The matter depends on potential customers to point out all scientific achievements and estimate their benefits.

You say, “the locomotive” the machine writes “milk”

But the situation with the dream of writers and journalists is more complicated. It is a computer that would perceive voice and transfer it into text so that you can read poems and articles, pacing around the room. These programs are sold and advertised, but none of them is able to replace typing. As a rule, more or less, they define only the voices of their creators. The common situation is like this: you say, “the locomotive” and machine writes “milk”. Juraś Hiecevič explains the low efficiency of such programs: “it is very difficult to single out the words from the speech stream and at the same time not to confuse anything”. However, our scientists are looking for solutions.

Authors: Julija Vasilišyna

Photo: Vitalij Hiĺ

Date of publication: 27.07.2012