Cognitive Model of Speech Dialogue
Collection and Analysis of Dialogue Data for Development of Dialogue System

Katsuhiko SHIRAI, Shu NAKAZATO and Shigeki OKAWA

School of Science and Engineering, Waseda University

{1. Introduction}
The main objective of this study is to develop a model of speech dialogue process. In our previous studies, our effort was focused on the collection and the analysis of dialogue data and also discussed about basic problems in modeling of the dialogue process. Connecting our speech recognition module and the dialogue management module, we started basic experiment.
As for the dialogue model, we investigated conditions of turn-taking and the mechanism of interruption. The model is further extended to include other phenomena of the dialogue process.

{2. Collection and Analysis of Dialogue Data}
To construct the model of communication mechanism in speech dialogue, we are collecting and analyzing the dialogue data which was recorded mode in a spontaneous and cooperative situation.
In this study, first, we analyzed erroneous utterances, stumbles, fillers, acknowledges and overlaps.

{2.1 Car-navigation Task}
The first data is man-machine dialogue which was collected using the our car-navigation system we developed. We collected the data in the method of Wizard of Oz. There were two groups of subjects, the one were told they were talking to a machine, another were told they were talking to a human. The difference of two groups was found in the dialogue data. In the man-machine dialogue, it is easy to find the end of utterance and turn taking, Both structures are simple. However, in two experiments, we observed ``restate'' and ``filler.'' These phenomena can be seen in such simple dialogues. We could not find ``acknowledgment'', ``interruption'' and ``overlapping'' in this task.

{2.2 Cross-word puzzle Task}
The second task is to solve of cross-word puzzle by the cooperation of two persons. They cooperate only through the speech communication. The cross-word puzzle is a very familiar task which need not training or special knowledge before the experiment. We observed spontaneous speech between two speakers who were given the keys of the solution separately.
The total number of utterances in 8 pairs was 1583. The average number by a task was 99.8. In this task, phenomena such as self-correction, filler, omission, etc. were frequently observed. We found that two speakers exhibited a cue to control the dialogue and to indicate the significance of information.
It was found that the cross word puzzle task was one of the suitable task to observe spontaneous speech.

{3. Estimation of Statistical Phoneme Center}
As a front-end of the dialogue system, we propose a new concept of statistical phoneme center and describe several properties which are effective to realize a high-reliable phoneme extraction in continuous speech.
The novelty of this study is to assume a fictitious center for each phoneme. The problem to associate the speech signal and the phoneme categories has been considered many times over the past years. Especially in 1970s many researchers investigated the process of speech perception by hearing experiments, but they could not reach to the exact solution.
Therefore, the phoneme center defined in this study means not necessarily the most remarkable point that exhibits the special property of each phoneme in classical sense. First, it is defined and determined as well by a statistical procedure. Second, it is not related directly to some physical characteristics as seen in spectrum but it reflects complex mixed properties found in speech sound of considerably long time interval.
In our method, the distribution of the phoneme center is estimated by an iterative training process using a large-amount of speech data. To determine the optimal distributions of the phoneme center, probabilities of acoustic features of neighboring frames are considered. We use two kinds of probability distribution: (i) discrete distribution, (ii) continuous distribution.
Next, the likelihoods of the phoneme centers are calculated by using conditional probabilities. Iterative process is performed by the movement of the center to the local maximum of the likelihood. Then we can realize the phoneme or word recognition by optimizing the likelihood of the statistical phoneme center based on DTW (dynamic time warping) technique.
As for the experiments, we use multi-speaker word data in ATR Japanese Speech Database. The first experiment is the phoneme center detection to test the validity of the statistical phoneme center. It is noticeable that if the detection rate is more than 95%, the erroneous insertion of phonemes is less than 42%.
The second experiment consists of the evaluation by phoneme recognition. The results show that the percent correct for all phonemes is over 90% and we can confirm the effectiveness of the likelihood of the phoneme center.
Although the proposed method is very effective in the statistical phoneme recognition, the phoneme center is also very suggestive to clarify the basic characteristics of phonemes.

{4. Extracting User's Intention from Key-word Lattice}
We investigated a method of extracting the task oriented user's intention from key-word lattice which is a recognition result of spontaneous speech by using word-spotting.
The possible word sequence is presumed by co-occurrence degree between word categories, and determine most reliable one to the user utterance. Then the word sequence is classified into the user intention according to the relation degree between the word category and the intention.
We found the improvement of word-recognition compared with general method that count the summation of scores of words.
As for intention recognition, our method showed 97.2% correct interpretation in our estimation with text input. In another estimation with an ideal speech input, that showed 83.3% correct interpretation.