Data Collection and Linguistic Analysis of Spoken Dialogue
Akira KUREMATSU, Jun SUWA, and Hideaki YAMAMOTO
Department of Electronics Engineering,
Univ. of Electro-Communications
Chofu-shi, Tokyo 182, JAPAN
e-mail: kure@electra.ee.uec.ac.jp
{1. Introduction}
The issues of data collection on spoken language is basis for
developing speech and language processing for dialogue system. The
importance of database of spontaneous speech database of spoken
language is highly emphasized to develop a robust spoken dialogue
system or speech translation system.
It is worthwhile to study the method of data collection of spontaneous
speech in terms of vocabulary, linguistic expression, task selection,
style of conversation such as human-machine and human-human, and
speakers. As for the vocabulary, it will be needed to cover the
vocabulary of several thousand words under the constraint of specific
task domain. As for the linguistic expression, there are many
characteristics in spoken language different from written language.
Ill-formed expressions such as hesitation, deletion, correction are
frequently uttered. Task selection is important in data collection as
the style of conversation is influenced by the task. Constraining the
conversation will be desirable to take the framework not to scatter
the conversation too widely. Utterances in the man-machine dialogue in
question and answering is different from those in the human-to-human
dialogues in the negotiation. For the style of conversation,
human-to-human dialogues will be the good examples for the ultimate
human-to-machine dialogues. It is required to collect as many
speakers as possible to develop the speaker-independent acoustic
model.
{2. Data Collection of Spoken Dialogue}
We have selected the scheduling task as the appropriate task of the
spontaneous spoken dialogue. In the scheduling task, two speakers talk
each other looking at each individual calendar. The speakers are told
to assume that they will make an appointment to meet for some specific
purpose. They exchange conversations to settle the convenient date
and time of their appointment. The calendar is for the next two weeks
or one month. The topic of the negotiation is free and the speakers
take their own suitable topics such as a meeting or an event
negotiation. When the common agreement is settled, the conversation
will be concluded.
{3. Data Recording Methods}
Methods of data collection will be as follows. All recordings of two
speakers are made using the DAT recorder in two different channels.
Speakers sit in the same room but are not facing each other to avoid
non-verbal communication. SennHeiser HMD 410 microphones are used by
both parties for all recordings. There are 13 scenarios of calendars
and the speakers are given arbitrarily selected scenario. Two
utterances sometimes are overlapped in time. Two channel recording
will allow such overlapped speech.
{4. Transcription}
After collection the dialogues are transcribed in three ways; (1)
Japanese Kana-Kanji characters, (2) Roman characters, and (3) Kana
characters. In the transcriptions of Roman characters and Kana
characters, spacing in the Bunsetsu period will be given. The speech
is transcribed including words, human noises, nom-human noises,
silence, false starts, and transcriber comments.
{5. Part-Of-Speech Tagging}
Part-of-speech is useful for the language model of spoken dialogue.
Part-of-speech is tagged by use of morphological analyzer. The 27
kinds of part-of-speech are followed by "Daijirin" dictionary.
{6. Analysis of Spoken Dialogue}
Issues of linguistical characteristics of spoken dialogue of
scheduling task are analyzed. The number of dialogues are 13, and the
total number of utterances is 1491. There are 779 different words out
of total words of 9961. The top 5 most frequent part-of-speech in the
beginning and ending of utterance are extracted. Analysis of
frequency of each part-of-speech is also calculated. Peculiar
characteristics are observed in the spontaneous dialogues such as
false start, hesitation, insertion, and stoppage in the middle of
sentence.
{7. Future Work}
Future works will be undertaken in the following items.
(1) Data collection of more data will be executed in the scheduling task.
(2) Methods of tagging part-of-speech which will be directed to
semi-automatic tagging will be studied.
(3) Acoustic modeling will be improved using the collected spontaneous
speech database.
(4) Linguistic characteristics in the stochastic features will be analyzed
in more detail.
(5) Prosodic analysis will be undertaken and prosodic information will be
attached to the collected spontaneous data.
(6) Intention of utterance will be analyzed and the information of
intention type will be attached to the database.