TTS SYNTHESIS FOR RUSSIAN LANGUAGE
Olga Krivnova (email@example.com)
Nina Zinovieva (firstname.lastname@example.org)
Leonid Zaharov (email@example.com)
Aleksey Babkin (firstname.lastname@example.org)
Lomonosov Moscow State University, Moscow, Russia
ABSTRACT This paper describes the main principles of Russian text-to- speech synthesis developing by Speech group of the Philological Department, Moscow Lomonosov University, Russia. The system is organized as a mixture of two methods: concatenation - on the segment level (using the linguistically motivated units - allophones spliced together to form the synthesized speech wave) and the rule-based method on the prosodic level (generating melodic and duration settings to modify speech wave created from the allophones according to the prosodic characteristics of the syntagma being synthesized). 1. GENERAL ARCHITECTURE OF THE SYSTEM Our system consists of the following functional blocks (or modules): Automatic transcriber converts input texts into a sequence of phoneme symbols organized as phrases or syntagmas with attached special marks (rhythmical, accentuation, intonation) for prosodic settings. o Allophonic coding module converts the transcribed texts into the sequence of codes (names) of phoneme-in-context (allophones) elements for concatenation. o The block of prosodic parameterization assigns duration (in msec) and melodic (in semi-tones and Hertz) values to chosen allophones according to the intonation and prominence structure of a phrase and phonetic characteristics of allophones on itself. o The allophonic data-base (acoustic inventory) contains a set of allophone files in wave format (prerecorded and stored segments of natural speech). o The block of control file generation forms the representation of the synthesized phrase as a sequence of concatenation elements' code names with assigned duration and fundamental frequency values. o The block of speech signal generating extracts the chosen concatenation units from the allophonic data-base and splices them together, smoothes the junctures between these elements and transforms them according to controlled data. The described system architecture is shown in a fig.1. Below the more detailed description of each of the above specified blocks will be given (see also [1;2]). 2. AUTOMATIC TRANSCRIBER (TRANS) Input text representation TRANS takes as an input text any sequence of orthographic words separated by gaps or punctuation marks. In each word the stressed vowel should be marked ( if there are any vowels). For compound words it is allowed to have more than one stress mark one of which is taken as primary. Word stress placement can be implemented manually or automatically. In the last case the input text is processed previously by morphological parser which is based on the dictionary presented in . Besides stress placement this parser determines the grammatical features of words analyzed (syntactic class, number, gender, aspect, case, etc.). These grammatical features are going to be used to disambiguate the grammatical homograph forms (ru+ki ~ ruki+). We are now working on this problem but it is a complex one and is not complete to the present time. As far as the other problems of previous text analysis is concerned we can handle the word strings with numbers and alphabetic abbreviations (on rule or on the base of special dictionary of the most frequently used items with their transcriptions), but these resources are not used in our system yet. TRANS phoneme inventory W e used the following phonemes inventory of Russian (our transcription is based on Russian alphabet, for convenience the Russian phoneme symbols are replaced here with their Latin conformity): 1. Stressed vowels: [A], [U], [I], [Y], [O], [E]; 2. Unstressed vowels of the first degree of reduction: [a], [u], [i], [y] [o], [e]. The last two unstressed vowels are not regularly used in standard Russian, but sometimes they are pronounced in loan words; 3. Unstressed vowels of the second degree of reduction: [ax], [ix], [ux] ; 4. Non-palatalized consonants: [p], [t], [k], [b], [d], [g], [f], [v],[s], [sh], [z], [zh], [x], [h], [c], [dz], [m], [n], [r], [l]; 5. Palatalized consonants: [p'], [t'], [k'], [b'], [d'], [g'], [f'], [v'],[s'], [sh'],[z'], [zh'], [x'], [ch'], [dzh'], [m'], [n'], [r'], [l'], [J'], [j']. One can see that the phonemic inventory used in our system slightly differs from that prevalent in Russian phonetic descriptions. This is because, for the purpose of synthesis, we had to choose such units that not only represent the phonemic relationships but also have acoustic and perceptual identity. It means that in some cases it is convenient to have different symbols in the transcription even for those phone pairs that are in no meaningful contrast. Phonological (phonemic) rules and special word lists Standard phonological rules of TRANS implement the mapping "letter - phoneme" and "phoneme - phoneme" which include such operations as elimination of spelling fictions, processing of pronunciation for some consonant clusters, removing of hard and soft signs in spelling, vowel letters processing with corresponding interpretations of hardness-softness of consonants, positional alternation of voiced-unvoiced, hard-soft features for consonants, vowel reduction and so on. The processes working as within a word and between words are taken into account. Irregular pronunciation of some word classes (e.g. loan words) and even individual words is accounted for by using special word lists. There are 54 such lists in TRANS. Rhythm and accentuation TRANS assigns some degree of prominence to each vowel in the synthesized phrase as its rhythmical feature. We distinguish three degrees of syllable prominence within a word, four degrees of lexically stressed syllable prominence within a phrase ( 1 for full clitics, 2 for functional words, 3 for non-nuclear meaningful words, 4 for nuclear meaningful word). The last meaningful word in the phrase is considered to be nuclear phrase stress by default. Though we are able to synthesize phrases with different focus accents we have no rules to determine their localization automatically: it should be done in manual fashion by special symbol (\) assigned to lexically stressed vowel (instead of the ordinary lexical stress marker). Prosodic word grouping and phrasing Within a phrase this process is closely connected with rhythmization of word sequence and is realized in our system by special feature "degree of prosodic break", assigned to the blanks between words. Here we have three breaks' levels: 0 - between full clitic (e.g. preposition) and meaningful word; 1 - between functional word and meaningful word; 2 - between meaningful words. This information is accounted for in phonological rules when processing external phoneme sandhi and vowel reduction. These types of breaks can not be realized with pause and reflect only the degree of word autonomy in the phrase sound pattern which is supposed to be coherent. On the contrary prosodic phrasing is supposed to have some pause after each phrase. We distinguish three degrees of pauses: short (about 250ms), moderate (about 400 ms) and long (about 800 ms). As far as the localization of phrase boundaries is concerned this problem is under investigation now and at present their places are fully determined by punctuation marks. Intonation To each phrase TRANS assigns one of 7 intonation models. In our system we use the following models: 2 models of finality; 2 models of non-finality for the affirmative sentences, 3 interrogative models (general question, special question, comparative question), 1 model for exclamation sentences. For all models the possibility of different positions of melodic center in the phrase is available (in some cases its position is determined by the nuclear phrase stress, while in others by focus accent mark). The choice of the appropriate model is based on punctuation mark and some lexical information, first of all we consider whether the phrase contains words of definite lexical classes (e.g. interrogative pronouns). It is obvious that these cues are not enough and more over the relation between punctuation and intonation models is rather difficult one, especially in Russian . This problem is also under investigation and it is clear for us that in general case some semantic and syntactic analysis will be needed to solve it. 3. ALLOPHONIC CODING MODULE This module converts the phonemic symbols used in transcribed texts into the sequence of codes (names) of phoneme-in-context (allophones in our system) elements for concatenation. Defining the characteristics of the basic unit of concatenation we proceed from the following three assumptions: 1. the amount of context-dependent variants is significantly larger for vowels than for consonants; 2. different consonants are affected by context influence to different degrees; 3. because of the prevalent CV-type of the Russian syllable, the left context is more important for vowels while the right context is more important for consonants. According to these assumptions and a vast amount of preliminary expert estimations of the phoneme-size wave segments taken from different contextual environments, we divided the set of phonemes in different classes regarding the contextual susceptibility of different phonemes. The names of allophones derived from the transcribed texts, and the names of concatenation units in the database reflect this specifics. The names of allophones are represented by the six-figure codes that are organized in the following way: the first figure refers to the contextual group of the coded phoneme, the second figure - the individual name of phoneme within the group, third and forth figure reflect the left significant context, the fifth and sixth - the right significant context for given phoneme. Example: 811010 - means that this allophone represents a phoneme of the 8th contextual group, the phoneme is a stressed /A/, in the position after and before an alveolar consonant. The allophone coding is the main but not the only procedure of the module described. There are two more operations: splitting of some phonemes (e.g. /d/ -> dPause + dBurst) and merging of some phonemes (e.g. /ix a/ -> 'A). 4. THE BLOCK OF PROSODIC PARAMETRIZATION Duration The duration rules were designed for and are applied separately to vowels (on the basis of quantitative model presented in ([4;5]) and consonants. The duration patterns of vowels are formed in accordance with their prominence levels and phonetic quality. Besides, for the stressed vowel of the last meaningful word in a phrase, we also take into consideration the number of syllables and the number of stressed vowels preceding it. We also apply the rules of vowel final lengthening (regardless the reduction level and vowel phonetic quality) before a pause. There are also special duration rules to process the sequences of vowels. As far as the influence of consonants on the vowel duration is concerned, we account for it only in the most prominent cases, like in the position before or after sonorants and unvoiced consonants. The general rules for consonant duration are based on the following factors: position of the consonant regarding the phrase boundaries; intervocalic - non-intervocalic position; position in the consonant cluster; the prominence level of the following vowel; simple - complex structure of the basic concatenation units used for the synthesis of the consonants. Phonetic quality of consonants and coarticulation effects on duration in clusters are also taken into account. In our system it is possible to control the overall tempo of pronunciation (evenly, only on consonants or only on vowels). Melodic and fundamental frequency contours The melodic patterns modeling is based on generating tone turning points (or target points of tone inflection) and their parameters in frequency and time domains. In this respect our approach is close to the so called linear intonation models . The rules for phrase melodic patterns assign usually two tonal values (in semi-tones) to every allophone - as its starting and final points. If it is necessary the third value can be assigned to any point inside the allophone. So, on the whole any melodic contour is approximated by linear tonal movements. The assigned values are calculated in left to right mode in syllabic cycles, that is in the frame of the CnV sequence, where Cn - any number of consonants (including 0) preceding the current vowel. The allophonic dissolution of the same tone movement is more detailed for prominent stressed syllables (especially nuclears and focuses) and less detailed for unstressed and weakly prominent stressed syllables. To assign the tone values we consider the following factors: the type of the intonation model; position of the syllable regarding the melodic center (nuclear or focus) of phrase (the center itself, to the right of it, to the left of it); the prominence level of the vowel (for stressed ones); position of the syllable or syllable sequence regarding the phrase boundaries; number and position of the syllable in the syllable chain (for unstressed and atonic syllables); the phonetic structure of the syllable; position of the allophone regarding the beginning of the syllable and its vowel nuclear. Melodic contour described in semi-tones is then transformed into Hz values accounting voiced - unvoiced feature of the consonants. This procedure takes also into account the Hz conformity of the base tone characteristic for a speaker. As far as the global tone parameters are concerned we can control the position of the contour within the whole voice range and the width of frequency range. Energy We can control the global and local trends in this parameter but the rules for it are not incorporated in our system yet. 5. THE ALLOPHONIC DATA-BASE (ACOUSTIC INVENTORY) The allophonic data-base is a set of allophone files in wave format, each file being named according to the allophonic coding assumptions described above. This acoustic inventory was derived from a Russian word list specially constructed and pronounced for this purpose. Each allophone wave was cut manually from the contextual representative surroundings which reflect the contextual group influence in the most prominent way. We use two acoustic inventories: one for male voice(SR 11025Hz; SS 8bit) with 158 consonant allophones and 530 vowels; the other for female voice (SR 22050Hz; SS 16-bit) with 200 consonant allophones and around 1000 vowels. All vocal sounds in the data-bases are marked semi-automatically according to their pitch periods to generate the output speech signal. 6. CONTROL FILE GENERATION This module converts the transcribed phrase into a sequence of allophones' code names with assigned duration and fundamental frequency values. Here is an example of the control file to generate the phrase "Zdravstvuyte, dorogie druz'ya!" (Hello, dear friends) for female voice. Allophone Duration F0 codes (in % for cons. in ms for vow.) 220301 85 150 178 000100 85 178 211 020001 85 0 0 610101 90 0 0 610707 90 0 0 811510 135 251 188 f265 40 210101 55 0 0 100000 60 0 0 120001 60 0 0 510102 65 173 165 991116 45 165 158 710201 60 158 157 100000 65 0 0 150004 65 0 0 981610 45 156 154 000100 70 154 153 020001 70 0 0 971015 85 153 152 610707 80 0 0 911516 112 151 150 000100 80 150 150 060004 90 0 0 831616 130 150 150 911610 105 154 158 000100 70 158 155 020001 70 0 0 610102 75 0 0 610707 75 0 0 921516 97 152 150 320304 80 150 150 710404 95 150 150 811618 227 150 133 100000 800 0 0 7. SPEECH SIGNAL GENERATION Signal generation is implemented according to the phrase control file the structure of which was described above. The necessary allophones are extracted from the database and spliced together. To transform the base allophones to duration and fundamental frequency values given by the phrase control file, we use procedures that are close to PSOLA technique in the time domain . The demo-examples of Russian speech synthesized by our system (in WAV format) can be found in INTERNET to the address http://isabase.philol.msu.ru/SpeechGroup 8. REFERENCES 1. N.V.Zinovieva, O.F.Krivnova. Lingvisticheskoe obespechenie4 programmnogo sinteza rechi (Linguistic Support for Programmed Speech Synthesis) // Vestnik MGU, s.9. Philologia. N3. M., 1994. 2. N.V.Zinovieva, O.F.Krivnova, L.M.Zaharov. Programmniy sintez russkoy rechi (sintezator"Agafon") (Automatic Speech Synthesis for Russian Language: sinthezator "Agaphon") // Computational Linguistics and its Applications. International Workshop "Dialogue95". Kazan, May31-June4, 1995. 3. A. A. Zaliznjak. Grammaticheskiy slovar' russkogo yazyka (Grammatical Dictionary of the Russian Language). Moscow, Russkij Yazyk, 1977. 4. O.F.Krivnova. Kolichestvennaya ocenka vozdeystviya suprasegmentnih faktorov na dlitel'nos't' udarnih glasnih v sintagme ( Quantitative Model of Stressed Vowel Duration under the unfluence of suprasegmetal factors) // Proceedings of 12th All-Union Seminar on Automatic Speech Recognition and Synthesis. Novosibirsk, 1984. 5. O.F.Krivnova. Durational Patterns of Russian Syntagma: The Standard Scheme and its Modifications. // Proc.of the XI-th International Congress of Phonetic Scienses. Tallinn, 1987. 6. Session: The structure of Intonation - linear or superpositional // Proc. of the XIII-th International Congress of Phonetic Scienses. Stockholm, 1995. 7. N.V.Zinovieva . Phonetically Sufficient Allophonic Database for Concatenation Synthesis of Russian Speech. // Proc. of the XIIIth International Congress of Phonetic Sciences. Stockholm, 1995. 8. Charpentier F., Moulines E. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones // Eurospeech89. Vol. 2. Paris 1989.