Data Collection and Linguistic Analysis of Spoken Dialogue


Department of Electronics Engineering, Univ. of Electro-Communications
Chofu-shi, Tokyo 182, JAPAN

{1. Introduction}
The issues of data collection on spoken language is basis for developing speech and language processing for dialogue system. The importance of database of spontaneous speech database of spoken language is highly emphasized to develop a robust spoken dialogue system or speech translation system.
It is worthwhile to study the method of data collection of spontaneous speech in terms of vocabulary, linguistic expression, task selection, style of conversation such as human-machine and human-human, and speakers. As for the vocabulary, it will be needed to cover the vocabulary of several thousand words under the constraint of specific task domain. As for the linguistic expression, there are many characteristics in spoken language different from written language. Ill-formed expressions such as hesitation, deletion, correction are frequently uttered. Task selection is important in data collection as the style of conversation is influenced by the task. Constraining the conversation will be desirable to take the framework not to scatter the conversation too widely. Utterances in the man-machine dialogue in question and answering is different from those in the human-to-human dialogues in the negotiation. For the style of conversation, human-to-human dialogues will be the good examples for the ultimate human-to-machine dialogues. It is required to collect as many speakers as possible to develop the speaker-independent acoustic model.

{2. Data Collection of Spoken Dialogue}
We have selected the scheduling task as the appropriate task of the spontaneous spoken dialogue. In the scheduling task, two speakers talk each other looking at each individual calendar. The speakers are told to assume that they will make an appointment to meet for some specific purpose. They exchange conversations to settle the convenient date and time of their appointment. The calendar is for the next two weeks or one month. The topic of the negotiation is free and the speakers take their own suitable topics such as a meeting or an event negotiation. When the common agreement is settled, the conversation will be concluded.

{3. Data Recording Methods}
Methods of data collection will be as follows. All recordings of two speakers are made using the DAT recorder in two different channels. Speakers sit in the same room but are not facing each other to avoid non-verbal communication. SennHeiser HMD 410 microphones are used by both parties for all recordings. There are 13 scenarios of calendars and the speakers are given arbitrarily selected scenario. Two utterances sometimes are overlapped in time. Two channel recording will allow such overlapped speech.

{4. Transcription}
After collection the dialogues are transcribed in three ways; (1) Japanese Kana-Kanji characters, (2) Roman characters, and (3) Kana characters. In the transcriptions of Roman characters and Kana characters, spacing in the Bunsetsu period will be given. The speech is transcribed including words, human noises, nom-human noises, silence, false starts, and transcriber comments.

{5. Part-Of-Speech Tagging}
Part-of-speech is useful for the language model of spoken dialogue. Part-of-speech is tagged by use of morphological analyzer. The 27 kinds of part-of-speech are followed by "Daijirin" dictionary.

{6. Analysis of Spoken Dialogue}
Issues of linguistical characteristics of spoken dialogue of scheduling task are analyzed. The number of dialogues are 13, and the total number of utterances is 1491. There are 779 different words out of total words of 9961. The top 5 most frequent part-of-speech in the beginning and ending of utterance are extracted. Analysis of frequency of each part-of-speech is also calculated. Peculiar characteristics are observed in the spontaneous dialogues such as false start, hesitation, insertion, and stoppage in the middle of sentence.

{7. Future Work}
Future works will be undertaken in the following items.
(1) Data collection of more data will be executed in the scheduling task.
(2) Methods of tagging part-of-speech which will be directed to semi-automatic tagging will be studied.
(3) Acoustic modeling will be improved using the collected spontaneous speech database.
(4) Linguistic characteristics in the stochastic features will be analyzed in more detail.
(5) Prosodic analysis will be undertaken and prosodic information will be attached to the collected spontaneous data.
(6) Intention of utterance will be analyzed and the information of intention type will be attached to the database.