Dialogue Speech Recognition by Simultaneous spotting of Phonemes, Words and Phrases
Yasuo ARIKI
Department of Electronics and Informatics, Ryukoku University
Seta, Otsu, JAPAN
e-mail: ariki@rins.ryukoku.ac.jp
In human speech recognition, we can discriminate known words and
unknown words contained in natural speech. This indicates following
two points. Firstly we can easily locate or spot important known words
and then understand the sentence meaning by skipping unknown words.
Secondly we can also learn phoneme sequence of unknown words from
their utterances and sometimes learn their meaning from the context.
This word spotting function seems to be effective even in a
man-machine dialogue system, because human speech is not so rigid nor
complete as it can be analyzed computationally by natural language
grammar. The purpose of this study is to clarify techniques to
discriminate known words from unknown words in natural speech without
using any language grammar, and also to recognize dialogue speech
based on the spotted words and phrases.
In order to discriminate known words, words knowledge should be
utilized such as word duration. For example, word duration can be
predicted using word dictionary and phoneme duration which can be
estimated by segmenting input utterance into phoneme sequence.
Therefore, at first we segment input utterance into phoneme sequence
and then use the phoneme information together with word dictionary for
discriminating known words.
In this paper, three word spotting techniques are compared and then
two types of word spotting are implemented and their results are
reported.
Word spotting techniques can be classified mainly into three groups.
The first is a probability ratio method. The probability ratio is the
ratio of word probability to the best phoneme sequence probability
included in the input utterance. If the input utterance is known word,
the probability ratio approaches to unit. Using this property, known
words are searched and extracted on continuous speech.
The second is a posteriori probability method. The probability ratio
can be shown to have the posteriori probability meaning. If the input
utterance is given, a posteriori probability of each word is computed
at every time t as gamma in forward and backward probability
computation. If a posteriori probability shows local peaks, known
words are located at the peaks.
The last one is the N best method. This approach basically spots one
word by Viterbi decoding. To catch more than one word on utterance, N
best words are searched by changing the word.
We carried out two techniques; probability ratio method and a
posteriori probability method. In the probability ratio method, the
spotting algorithm is summarized as follows;
In the posteriori probability method, the spotting algorithm is
summaried as follows;
Word spotting experiments were carried out on Japanese continuous
speech. The words to be spotted are 57 different words. Their acoustic
models are constructed by concatenating phoneme HMMs according to a
dictionary. The number of phoneme HMMs is 47 and they are trained
using half of ATR 5240 important words spoken by one person. The
continuous speech, on which the words are to be spotted, is ATR 25
sentences spoken by the same person as in the important words.
The spotting result based on the probability ratio showed 86.7 percent
of CR and 47.3 times of FA when no phoneme information is used. On the
other hand, when the phoneme information is used, it showed 45.6
percent of CR and 3.1 times of FA. Here CR is the ratio of the number
of correctly spotted words to the number of total words to be spotted.
False alarm is the ratio of the number of falsely spotted words to the
number of total words to be spotted. It can be said that false alarm
(FA) can be reduced in the case of using phoneme information as well
as phoneme boundaries.
The spotting result based on the posteriori probability showed 37.8
percent of CR and 1.12 times of FA. Its FA is very small because the
known words are determined at the local peak of the posteriori
probability of word existence.
In this paper, word spotting techniques are described which extract
phonemes and words simultaneously and discriminate known words using
phoneme information for reducing false alarm.
Keywords: spotting, known words, posteriori probability, phoneme information