Yasuhisa NIIMI and Yutaka KOBAYASHI

Department of Electronics and Information Science,
Kyoto Institute of Technology

Matsugasaki, Sakyo-ku, Kyoto, 606 JAPAN

e-mail: niimi@dj.kit.ac.jp

A number of attempts have been made to study spoken dialogue systems.
However, current technology for speech recognition, which has made
remarkable progress, is still insufficient for complete recognition of
utterances in spoken dialogue. So dialogue systems need to confirm
recognized utterances. This paper considers three dialogue control
strategies to relieve speech recognition errors. These are the prompt
to speak again, the direct confirmation and the indirect confirmation.
Here assume that the dialogue system have recognized an utterance as the
sentence, ``Please tell me the entrance fee of Kinkakuji temple.'' If the
system cannot accept the sentence reliably, it has three options; it
prompts the user to speak again, confirms directly by saying, ``You mean
an entrance fee of Kinkakuji temple ?'', or makes an indirect
confirmation by answering, ``You can enter Kinkakuji temple by 500 yen,''
instead of answering, ``it's 500 yen.''

The purpose of modeling the dialog control strategies is to estimate two
quantities $P_{ac}$ and $N$; $P_{ac}$ is the probability that
information included in user's utterance is conveyed to the system
correctly, and $N$ is the average number of turns taken between the user
and the system until terminating subdialogue on user's first utterance.

The first dialogue control strategy, the simplest of the three, is that
the dialogue system accepts user's utterances when their recognition scores
are greater than a threshold value, but rejects them otherwise and prompts
the user to speak again. The dialogue system using this strategy is called
model 0. Now assume we know the probability, denoted by $a$, that user's
utterances are accepted, and the probability, denoted by $p$, that accepted
utterances have been recognized correctly. How to estimate these two
parameters will be explained later. Then $P_{ac}^{(0)}$ and $N^{(0)}$ (the upper
scripts indicate the model index) are given by the following formulae:
\[ P_{ac}^{(0)}=p, \makebox[30mm]{and} N^{(0)}=\frac{2}{a}-1. \]
Since $p$ is expected to be inversely proportional to $a$, it is
necessary for $a$ to make small in order to increase $P_{ac}^{(0)}=p$.
This, however, makes $N^{(0)}$ large. Some tradeoff is then needed between
$P_{ac}^{(0)}$ and $N^{(0)}$.

Now we consider how to estimate $a$ and $p$. Let $A$ denote the acoustic data
stream of an utterance, and $W$ denote a string of words. We can adopt
the conditional probability $P(W/A)$ of $W$ given $A$ as a recognition score.
The recognized string of words is such a string that maximizes $P(W/A)$
under the given linguistic constraint. By Bayes' theorem,
\[ P(W/A)=P(A/W)P(W)/P(A).\]
The quantity $P(A/W)P(W)$, which is used as a conventional criterion in
speech recognition, is computed by using the hidden Markov model and the
language model. Two methods can be considered for estimating $P(A)$;
the first is to approximate $P(A)$ by $\max\{P(X)P(A/X)\}$ where $X$ is
a string of phonemes, and the second is to use the HMM to compute $P(A)$
directly. Using this scheme to compute $P(W/A)$'s for many training
utterances, we can create a distribution for $P(W/A)$. Selecting a
threshold value $\theta$, we can estimate a as the area of the portion
of the distribution in which the inequality $P(W/A)\geq\theta$ is
satisfied. $p$ is also estimated in the similar way by using separate
distributions created from correct recognitions and incorrect
recognitions.

Now we return to the dialogue control strategy. The second strategy
is the direct confirmation. The system using this strategy is called
model 1. By this strategy the system confirms recognized utterances
when their recognition scores are less than the threshold value
$\theta$, while it accepts them otherwise. User's response to this
confirmation is assumed to be either ``yes'' or ``no'' for simplicity.
When the response cannot be accepted, the user is asked to tell again
what he has said first. Assuming we know the probability, denoted by
q, of having recognized correctly the utterances for which the
confirmation is made, $P_{ac}^{(1)}$ and $N^{(1)}$ of the model 1 are
given by the following formulae;
\[ P_{ac} = \frac{p\{1+(1-\alpha)q\}}{1+(1-\alpha)\beta} \]
and
\[ N^{(1)} = \frac{\alpha + (1-\alpha)(4-\alpha\beta)}
{\alpha + (1-\alpha)\alpha\beta} \]
where $\beta=1+2pq-p-q$.

It is proven by simple calculation that $N^{(1)}>N^{(0)}$, and
$P_{ac}^{(1)}>P_{ac}^{(0)}$ if $q>1/2$.

Finally we consider more complex strategy: indirect confirmation. The dialogue
system using this strategy is called model 2. The model 2 uses indirect
confirmations as well as direct confirmations. Assume the followings for the
performance of the system and user's response to the indirect confirmation.

Under these assumptions, $P_{ac}^{(2)}$ and $N^{(2)}$ of the model 2 are
given by the following formulae;
\[ P_{ac}^{(2)} = \frac{p\{1+(1-\alpha)[1+(q-1)\gamma]\}}
{1+(1-\alpha)\{1+(1-\beta)\gamma\}} \]
and
\[ N^{(2)} = \frac{\alpha+(1-\alpha) [(4-\alpha\beta)\gamma+
(4-\alpha-2\alpha pq)(1-\gamma)]}
{\alpha+(1-\alpha)[\alpha\beta\gamma+\alpha(1-\gamma)]} \]
In this case it is proven that $P_{ac}^{(2)}>P_{ac}^{(0)}$ if $q>1/2$,
and $P_{ac}^{(2)}=P_{ac}^{(0)}$ and $N^{(2)}>N^{(0)}$ if $\gamma=0$,
that is, if the system adopts only the indirect confirmation.

This paper has reported three dialogue control strategies to relieve errors
in speech recognition, and analyzed them mathematically. The analysis has
proven that the direct confirmation can increase the probability that
information included in user's utterances is conveyed to the system
correctly, and the indirect confirmation can reduce the average number of
turns exchanged between the user and the system.