Dialogue Speech Recognition by Simultaneous spotting of Phonemes, Words and Phrases


Department of Electronics and Informatics, Ryukoku University
Seta, Otsu, JAPAN
e-mail: ariki@rins.ryukoku.ac.jp

In human speech recognition, we can discriminate known words and unknown words contained in natural speech. This indicates following two points. Firstly we can easily locate or spot important known words and then understand the sentence meaning by skipping unknown words. Secondly we can also learn phoneme sequence of unknown words from their utterances and sometimes learn their meaning from the context.
This word spotting function seems to be effective even in a man-machine dialogue system, because human speech is not so rigid nor complete as it can be analyzed computationally by natural language grammar. The purpose of this study is to clarify techniques to discriminate known words from unknown words in natural speech without using any language grammar, and also to recognize dialogue speech based on the spotted words and phrases.
In order to discriminate known words, words knowledge should be utilized such as word duration. For example, word duration can be predicted using word dictionary and phoneme duration which can be estimated by segmenting input utterance into phoneme sequence. Therefore, at first we segment input utterance into phoneme sequence and then use the phoneme information together with word dictionary for discriminating known words.
In this paper, three word spotting techniques are compared and then two types of word spotting are implemented and their results are reported.
Word spotting techniques can be classified mainly into three groups. The first is a probability ratio method. The probability ratio is the ratio of word probability to the best phoneme sequence probability included in the input utterance. If the input utterance is known word, the probability ratio approaches to unit. Using this property, known words are searched and extracted on continuous speech.
The second is a posteriori probability method. The probability ratio can be shown to have the posteriori probability meaning. If the input utterance is given, a posteriori probability of each word is computed at every time t as gamma in forward and backward probability computation. If a posteriori probability shows local peaks, known words are located at the peaks.
The last one is the N best method. This approach basically spots one word by Viterbi decoding. To catch more than one word on utterance, N best words are searched by changing the word.
We carried out two techniques; probability ratio method and a posteriori probability method. In the probability ratio method, the spotting algorithm is summarized as follows;

  • The best words are selected as the candidates at every time.
  • The words which satisfy the following conditions are picked up; word duration, probability ratio, word probability, the number of phonemes, difference of word probabilities, difference of probability ratio.
  • To reduce the false alarm, only the position with the highest probability is located, in the case where the same word survive as candidate over the consecutive frames.

    In the posteriori probability method, the spotting algorithm is summaried as follows;

  • Forward and backward probabilities are computed for the input speech.
  • Using the forward and backward probability, the gamma probability which shows the word existence probability is computed.
  • After smoothing the gamma in time at each word, the local peaks are picked up.
  • The peaks which exceed some threshold are determined as the known words.

    Word spotting experiments were carried out on Japanese continuous speech. The words to be spotted are 57 different words. Their acoustic models are constructed by concatenating phoneme HMMs according to a dictionary. The number of phoneme HMMs is 47 and they are trained using half of ATR 5240 important words spoken by one person. The continuous speech, on which the words are to be spotted, is ATR 25 sentences spoken by the same person as in the important words.
    The spotting result based on the probability ratio showed 86.7 percent of CR and 47.3 times of FA when no phoneme information is used. On the other hand, when the phoneme information is used, it showed 45.6 percent of CR and 3.1 times of FA. Here CR is the ratio of the number of correctly spotted words to the number of total words to be spotted. False alarm is the ratio of the number of falsely spotted words to the number of total words to be spotted. It can be said that false alarm (FA) can be reduced in the case of using phoneme information as well as phoneme boundaries.
    The spotting result based on the posteriori probability showed 37.8 percent of CR and 1.12 times of FA. Its FA is very small because the known words are determined at the local peak of the posteriori probability of word existence.
    In this paper, word spotting techniques are described which extract phonemes and words simultaneously and discriminate known words using phoneme information for reducing false alarm.

    Keywords: spotting, known words, posteriori probability, phoneme information