The Phonemaphone–2000 text-to-speech system

The speech synthesizer is a knowledge-based software product that implements the model of reading an arbitrary text out loud by a person. The product operates on Windows operating systems through a standard input/output device.

From the user’s perspective, the speech synthesizer is a new tool for implementing voice output of information from a computer, which complements or, in some cases, substitutes the visual output of information. The computer user who applies synthesizer can weaken eyestrain by receiving part of the information with voice. S/he can also receive information being at some distance from the computer and, when using an additional telephone interface, impart and receive speech information by phone. Therefore, the speech synthesizer represents a unique tool for transmitting information to the blind and has an excellent capacity for creating computer systems of speech training.

The speech synthesizer (SS) aims to read any text information out loud instead of merely playing previously recorded audio files. In fact, this technology opens one more data transmission channel similar to that which we have due to the display. From the user’s perspective, the most easy (reasonable) way to use SS is to integrate it in the operating system with a view to make it multilingual and able to provide translation. In the same way as one can use the command Print, one could use the command Talk to call the SS. By using speech synthesis, computers will be able to voice menu navigation, to read screen messages, file catalogs, to transmit voice messages by phone, and etc. These functions are especially significant for people with visual disabilities. For all others, the technology will create a new dimension of convenience to use a computer and will reduce significantly the strain on nervous system and eyes.

The speech synthesizer can be applied not only by PC users. The product is also useful in automated systems with voice interface, in household appliances to voice commands and fulfilled actions, in PDAs, electronic dictionaries, organizers, and mobile phones to read screen messages, in portable scanners to read scanned information in real time.

The principle of the speech synthesizer operation is based on the creation of speech signal which corresponds to the input text. This requires using the acoustic database with more than 2000 smallest sound units, which are cut from natural speech samples of a particular speaker. Therefore, synthesized speech preserves individual peculiarities of voice, accent and intonation. Creation of several voice databases allows generating messages with different voices. Figuratively speaking, this approach represents the computer cloning of voice and speech.

The speech synthesizer operates according to the following scheme: the input orthographic text is at first processed by the text processing module, where word stress assignment, letter-to-phoneme conversion of the text, dividing the text into syntagmas, marking the intonation type of the syntagmas take place. On the second stage the obtained marked phonemic text is sent to the prosodic and phonetic processors. The phonetic processor generates positional and combinatorial allophones of vowel and consonant phonemes. The prosodic processor calculates the target prosodic parameters: fundamental frequency, amplitude, and duration for each allophone.

The speech synthesizer was developed for Windows operational systems using the MS Visual C++ 6.0 visual development environment. Minimum system requirements: 20 Mb of available hard-disk space for installation, 166 MHz CPU processor, 32 Mb of RAM, sound card.

The system consists of components corresponding to the text, phonetic, prosodic and acoustic processors. Each processor is a COM-object and operates independently of the other processors as data is available. As system resources, database of word stresses and that of allophones and multiphones are used. Being located in the dynamic link libraries, both databases are quickly loaded and unloaded when shutting down or initializing the system. To place word stresses, a binary search algorithm is used. The input text is analyzed by sintagmas: after a sintagma being extracted, word stresses are placed, letter-to-phoneme and phoneme-to-allophone conversions are carried out. Prosodic (intonation) parameters of the sintagma are defined according to its type. After being processed, the sintagma is pronounced. In parallel with these actions, the next syntagma is identified and processed in the same way. Thus, a separate thread or a processor is responsible for a separate action making the delay between pronouncing sintagmas hardly noticable. The synthesized speech can be saved to a wav-file.

If you have found a spelling error, please, notify us by selecting that text and pressing Ctrl+Enter.

Speech Synthesis and Recognition Laboratory

United Institute of Informatics Problems of National Academy of Sciences of Belarus

United Institute of Informatics Problems of National Academy of Sciences of Belarus

United Institute of Informatics Problems of National Academy of Sciences of Belarus

Spelling error report

The following text will be sent to our editors:

Your comment (optional):