TTS SYNTHESIS FOR RUSSIAN LANGUAGE

Olga Krivnova (okri@philol.msu.ru)
Nina Zinovieva (nina@philol.msu.ru)
Leonid Zaharov (leon@philol.msu.ru)
Grigoriy Strokin(grg@philol.msu.ru)
Aleksey Babkin (avb@science.park.ru)

Lomonosov Moscow State University, Moscow, Russia


        ABSTRACT

        This paper describes the main principles of Russian text-to-
speech synthesis developing by Speech group of the Philological
Department, Moscow Lomonosov University, Russia. The system is
organized as a mixture of two methods: concatenation - on the segment
level (using the linguistically motivated units - allophones spliced
together to form the synthesized speech wave) and the rule-based
method on the prosodic level (generating melodic and duration settings
to modify speech wave created from the allophones according to the
prosodic characteristics of the syntagma being synthesized).

        1. GENERAL ARCHITECTURE OF THE SYSTEM

        Our system consists of the following functional blocks (or
modules):

         Automatic transcriber
        converts input texts into a sequence of phoneme symbols
organized as phrases or syntagmas with attached special marks
(rhythmical, accentuation, intonation) for prosodic settings.

        o Allophonic coding module
        converts the transcribed texts into the sequence of codes
(names) of phoneme-in-context (allophones) elements for concatenation.

        o The block of prosodic parameterization
        assigns duration (in msec) and melodic (in semi-tones and Hertz)
values to chosen allophones according to the intonation and prominence
structure of a phrase and phonetic characteristics of allophones on
itself.

        o The allophonic data-base (acoustic inventory)
        contains a set of allophone files in wave format (prerecorded
and stored segments of natural speech).

        o The block of control file generation
        forms the representation of the synthesized phrase as a sequence
of concatenation elements' code names with assigned duration and
fundamental frequency values.

        o The block of speech signal generating
        extracts the chosen concatenation units from the allophonic
data-base and splices them together, smoothes the junctures between
these elements and transforms them according to controlled data.

        The described system architecture is shown in a fig.1. Below the
more detailed description of each of the above specified blocks will
be given (see also [1;2]).

        2. AUTOMATIC TRANSCRIBER (TRANS)

        Input text representation
        TRANS takes as an input text any sequence of orthographic words
separated by gaps or punctuation marks. In each word the stressed
vowel should be marked ( if there are any vowels).
        For compound words it is allowed to have more than one stress
mark one of which is taken as primary. Word stress placement can be
implemented manually or automatically. In the last case the input text
is processed previously by morphological parser which is based on the
dictionary presented in [3]. Besides stress placement this parser
determines the grammatical features of words analyzed (syntactic
class, number, gender, aspect, case, etc.). These grammatical features
are going to be used to disambiguate the grammatical homograph forms
(ru+ki ~ ruki+). We are now working on this problem but it is a
complex one and is not complete to the present time.
        As far as the other problems of previous text analysis is
concerned we can handle the word strings with numbers and alphabetic
abbreviations (on rule or on the base of special dictionary of the
most frequently used items with their transcriptions), but these
resources are not used in our system yet.

        TRANS phoneme inventory
        W e used the following phonemes inventory of Russian (our
transcription is based on Russian alphabet, for convenience the
Russian phoneme symbols are replaced here with their Latin
conformity):
        1. Stressed vowels: [A], [U], [I], [Y], [O], [E];
        2. Unstressed vowels of the first degree of reduction: [a], [u],
[i], [y] [o], [e]. The last two unstressed vowels are not regularly
used in standard Russian, but sometimes they are pronounced in loan
words;
        3. Unstressed vowels of the second degree of reduction: [ax],
[ix], [ux] ;
        4. Non-palatalized consonants: [p], [t], [k], [b], [d], [g],
[f], [v],[s], [sh], [z], [zh], [x], [h], [c], [dz], [m], [n], [r],
[l];
        5. Palatalized consonants: [p'], [t'], [k'], [b'], [d'], [g'],
[f'], [v'],[s'], [sh'],[z'], [zh'], [x'], [ch'], [dzh'], [m'], [n'],
[r'], [l'], [J'], [j'].

        One can see that the phonemic inventory used in our system
slightly differs from that prevalent in Russian phonetic descriptions.
This is because, for the purpose of synthesis, we had to choose such
units that not only represent the phonemic relationships but also have
acoustic and perceptual identity. It means that in some cases it is
convenient to have different symbols in the transcription even for
those phone pairs that are in no meaningful contrast.

        Phonological (phonemic) rules and special word lists
        Standard phonological rules of TRANS implement the mapping
"letter - phoneme" and
        "phoneme - phoneme" which include such operations as elimination
of spelling fictions, processing of pronunciation for some consonant
clusters, removing of hard and soft signs in spelling, vowel letters
processing with corresponding interpretations of hardness-softness of
consonants, positional alternation of voiced-unvoiced, hard-soft
features for consonants, vowel reduction and so on. The processes
working as within a word and between words are taken into account.
        Irregular pronunciation of some word classes (e.g. loan words)
and even individual words is accounted for by using special word
lists. There are 54 such lists in TRANS.

        Rhythm and accentuation
        TRANS assigns some degree of prominence to each vowel in the
synthesized phrase as its rhythmical feature. We distinguish three
degrees of syllable prominence within a word, four degrees of
lexically stressed syllable prominence within a phrase ( 1 for full
clitics, 2 for functional words, 3 for non-nuclear meaningful words, 4
for nuclear meaningful word). The last meaningful word in the phrase
is considered to be nuclear phrase stress by default. Though we are
able to synthesize phrases with different focus accents we have no
rules to determine their localization automatically: it should be done
in manual fashion by special symbol (\) assigned to lexically stressed
vowel (instead of the ordinary lexical stress marker).

        Prosodic word grouping and phrasing
        Within a phrase this process is closely connected with
rhythmization of word sequence and is realized in our system by
special feature "degree of prosodic break", assigned to the blanks
between words. Here we have three breaks' levels: 0 - between full
clitic (e.g. preposition) and meaningful word; 1 - between functional
word and meaningful word; 2 - between meaningful words. This
information is accounted for in phonological rules when processing
external phoneme sandhi and vowel reduction. These types of breaks can
not be realized with pause and reflect only the degree of word
autonomy in the phrase sound pattern which is supposed to be coherent.
        On the contrary prosodic phrasing is supposed to have some pause
after each phrase.
        We distinguish three degrees of pauses: short (about 250ms),
moderate (about 400 ms) and long (about 800 ms). As far as the
localization of phrase boundaries is concerned this problem is under
investigation now and at present their places are fully determined by
punctuation marks.

        Intonation
        To each phrase TRANS assigns one of 7 intonation models. In our
system we use the following models: 2 models of finality; 2 models of
non-finality for the affirmative sentences, 3 interrogative models
(general question, special question, comparative question), 1 model
for exclamation sentences. For all models the possibility of different
positions of melodic center in the phrase is available (in some cases
its position is determined by the nuclear phrase stress, while in
others by focus accent mark). The choice of the appropriate model is
based on punctuation mark and some lexical information, first of all
we consider whether the phrase contains words of definite lexical
classes (e.g. interrogative pronouns). It is obvious that these cues
are not enough and more over the relation between punctuation and
intonation models is rather difficult one, especially in Russian .
This problem is also under investigation and it is clear for us that
in general case some semantic and syntactic analysis will be needed to
solve it.

        3. ALLOPHONIC CODING MODULE

        This module converts the phonemic symbols used in transcribed
texts into the sequence of codes (names) of phoneme-in-context
(allophones in our system) elements for concatenation.
        Defining the characteristics of the basic unit of concatenation
we proceed from the following three assumptions[7]:
        1. the amount of context-dependent variants is significantly
larger for vowels than for consonants;
        2. different consonants are affected by context influence to
different degrees;
        3. because of the prevalent CV-type of the Russian syllable, the
left context is more important for vowels while the right context is
more important for consonants.
        According to these assumptions and a vast amount of preliminary
expert estimations of the phoneme-size wave segments taken from
different contextual environments, we divided the set of phonemes in
different classes regarding the contextual susceptibility of different
phonemes. The names of allophones derived from the transcribed texts,
and the names of concatenation units in the database reflect this
specifics. The names of allophones are represented by the six-figure
codes that are organized in the following way: the first figure refers
to the contextual group of the coded phoneme, the second figure - the
individual name of phoneme within the group, third and forth figure
reflect the left significant context, the fifth and sixth - the right
significant context for given phoneme.
        Example: 811010 - means that this allophone represents a phoneme
of the 8th contextual group, the phoneme is a stressed /A/, in the
position after and before an alveolar consonant.

        The allophone coding is the main but not the only procedure of
the module described. There are two more operations: splitting of some
phonemes (e.g. /d/ -> dPause + dBurst) and merging of some phonemes
(e.g. /ix a/ -> 'A).

        4. THE BLOCK OF PROSODIC PARAMETRIZATION

        Duration
        The duration rules were designed for and are applied separately
to vowels (on the basis of quantitative model presented in ([4;5]) and
consonants. The duration patterns of vowels are formed in accordance
with their prominence levels and phonetic quality. Besides, for the
stressed vowel of the last meaningful word in a phrase, we also take
into consideration the number of syllables and the number of stressed
vowels preceding it. We also apply the rules of vowel final
lengthening (regardless the reduction level and vowel phonetic
quality) before a pause. There are also special duration rules to
process the sequences of vowels.
        As far as the influence of consonants on the vowel duration is
concerned, we account for it only in the most prominent cases, like in
the position before or after sonorants and unvoiced consonants.
        The general rules for consonant duration are based on the
following factors: position of the consonant regarding the phrase
boundaries; intervocalic - non-intervocalic position; position in the
consonant cluster; the prominence level of the following vowel; simple
- complex structure of the basic concatenation units used for the
synthesis of the consonants. Phonetic quality of consonants and
coarticulation effects on duration in clusters are also taken into
account.
        In our system it is possible to control the overall tempo of
pronunciation (evenly, only on consonants or only on vowels).

        Melodic and fundamental frequency contours
        The melodic patterns modeling is based on generating tone
turning points (or target points of tone inflection) and their
parameters in frequency and time domains. In this respect our approach
is close to the so called linear intonation models [5].
        The rules for phrase melodic patterns assign usually two tonal
values (in semi-tones) to every allophone - as its starting and final
points. If it is necessary the third value can be assigned to any
point inside the allophone. So, on the whole any melodic contour is
approximated by linear tonal movements.
        The assigned values are calculated in left to right mode in
syllabic cycles, that is in the frame of the CnV sequence, where Cn -
any number of consonants (including 0) preceding the current vowel.
The allophonic dissolution of the same tone movement is more detailed
for prominent stressed syllables (especially nuclears and focuses) and
less detailed for unstressed and weakly prominent stressed syllables.
        To assign the tone values we consider the following factors: the
type of the intonation model; position of the syllable regarding the
melodic center (nuclear or focus) of phrase (the center itself, to the
right of it, to the left of it); the prominence level of the vowel
(for stressed ones); position of the syllable or syllable sequence
regarding the phrase boundaries; number and position of the syllable
in the syllable chain (for unstressed and atonic syllables); the
phonetic structure of the syllable; position of the allophone
regarding the beginning of the syllable and its vowel nuclear.
        Melodic contour described in semi-tones is then transformed into
Hz values accounting voiced - unvoiced feature of the consonants. This
procedure takes also into account the Hz conformity of the base tone
characteristic for a speaker.
        As far as the global tone parameters are concerned we can
control the position of the contour within the whole voice range and
the width of frequency range.

        Energy
        We can control the global and local trends in this parameter but
the rules for it are not incorporated in our system yet.

        5. THE ALLOPHONIC DATA-BASE (ACOUSTIC INVENTORY)

        The allophonic data-base is a set of allophone files in wave
format, each file being named according to the allophonic coding
assumptions described above. This acoustic inventory was derived from
a Russian word list specially constructed and pronounced for this
purpose. Each allophone wave was cut manually from the contextual
representative surroundings which reflect the contextual group
influence in the most prominent way.
        We use two acoustic inventories: one for male voice(SR 11025Hz;
SS 8bit) with 158 consonant allophones and 530 vowels; the other for
female voice (SR 22050Hz; SS 16-bit) with 200 consonant allophones and
around 1000 vowels.
        All vocal sounds in the data-bases are marked semi-automatically
according to their pitch periods to generate the output speech signal.

        6. CONTROL FILE GENERATION

        This module converts the transcribed phrase into a sequence of
allophones' code names with assigned duration and fundamental
frequency values. Here is an example of the control file to generate
the phrase "Zdravstvuyte, dorogie druz'ya!" (Hello, dear friends) for
female voice.




        Allophone       Duration         F0
        codes           (in % for cons.
        in ms for vow.)

        220301          85              150     178
        000100          85              178     211
        020001          85              0       0
        610101          90              0       0
        610707          90              0       0
        811510          135             251     188     f265 40
        210101          55              0       0
        100000          60              0       0
        120001          60              0       0
        510102          65              173     165
        991116          45              165     158
        710201          60              158     157
        100000          65              0        0
        150004          65              0       0
        981610          45              156     154
        000100          70              154     153
        020001          70              0        0
        971015          85              153     152
        610707          80              0       0
        911516          112             151     150
        000100          80              150     150
        060004          90              0       0
        831616          130             150     150
        911610          105             154     158
        000100          70              158     155
        020001          70              0       0
        610102          75              0       0
        610707          75              0       0
        921516          97              152     150
        320304          80              150     150
        710404          95              150     150
        811618          227             150     133
        100000          800             0       0

        7. SPEECH SIGNAL GENERATION

        Signal generation is implemented according to the phrase control
file the structure of which was described above. The necessary
allophones are extracted from the database and spliced together.
        To transform the base allophones to duration and fundamental
frequency values given by the phrase control file, we use procedures
that are close to PSOLA technique in the time domain [8].

        The demo-examples of Russian speech synthesized by our system
(in WAV format) can be found in INTERNET to the address
        http://isabase.philol.msu.ru/SpeechGroup


        8. REFERENCES

        1. N.V.Zinovieva, O.F.Krivnova. Lingvisticheskoe obespechenie4
programmnogo sinteza rechi (Linguistic Support for Programmed Speech
Synthesis) // Vestnik MGU, s.9. Philologia. N3. M., 1994.

        2. N.V.Zinovieva, O.F.Krivnova, L.M.Zaharov. Programmniy sintez
russkoy rechi (sintezator"Agafon") (Automatic Speech Synthesis for
Russian Language: sinthezator "Agaphon") // Computational Linguistics
and its Applications. International Workshop "Dialogue95". Kazan,
May31-June4, 1995.

        3. A. A. Zaliznjak. Grammaticheskiy slovar' russkogo yazyka
(Grammatical Dictionary of the Russian Language). Moscow, Russkij
Yazyk, 1977.

        4. O.F.Krivnova. Kolichestvennaya ocenka vozdeystviya
suprasegmentnih faktorov na dlitel'nos't'
        udarnih glasnih v sintagme ( Quantitative Model of Stressed
Vowel Duration under the unfluence of suprasegmetal factors) //
Proceedings of 12th All-Union Seminar on Automatic Speech Recognition
and Synthesis. Novosibirsk, 1984.

        5. O.F.Krivnova. Durational Patterns of Russian Syntagma: The
Standard Scheme and its Modifications. // Proc.of the XI-th
International Congress of Phonetic Scienses. Tallinn, 1987.

        6. Session: The structure of Intonation - linear or
superpositional // Proc. of the XIII-th International Congress of
Phonetic Scienses. Stockholm, 1995.

        7. N.V.Zinovieva . Phonetically Sufficient Allophonic Database
for Concatenation Synthesis of Russian Speech. // Proc. of the XIIIth
International Congress of Phonetic Sciences. Stockholm, 1995.

        8. Charpentier F., Moulines E. Pitch-synchronous waveform
processing techniques for text-to-speech synthesis using diphones //
Eurospeech89. Vol. 2. Paris 1989.