Creation and Application of Dialogue Corpora
-- Designing, Recording and Statistical Analysis of Spoken Dialogue Corpora --

Shuichi ITAHASHI, Kazuyuki TAKAGI and Naoko HOURA

Institute of Information Sciences and Electronics, University of Tsukuba
1-1-1 Tennodai, Tsukuba, Ibaraki 305 Japan

For years, many researchers have pointed out that spoken dialogue corpus is indispensable for promoting spoken dialogue research. However, much less discussion has been made on how to design it or what to choose when several corpora are available.
It is very difficult to have general discussions on the contents of spoken dialogue corpus, so we are obliged to deal with them individually. In terms of formal properties, however, we will be able to have a sort of design principles or guidelines for selecting the most suitable corpus.
For instance, when there are transcribed texts of simulated dialogues, we may obtain reference data to decide which material is more suitable for given objective by analyzing statistical characteristics of the texts. As a first step, we investigated several statistics of occurrences of phonemes and moras in the transcribed texts and compared the results among the corpora.
Speech materials used for statistical analysis were taken from ASJ (Acoustical Society of Japan) continuous speech corpus volumes one to six and ATR dialogue corpus. ASJ corpus includes ATR 503 phonetically balanced sentences, 3129 simulated dialogue sentences and 1027 various guide task sentences. ATR dialogue corpus comprises 10,610 sentences.
We investigated the occurrence frequencies of a phoneme, two phoneme sequence, a mora and two mora sequence, using 55 phonemic units, which includes palatalyzed phonemes such as /ky, sy, ny, .../, and 217 moraic units, which includes CwV for transcribing loan words where C stands for a consonant and V a vowel. Phonemes of the highest occurrence frequency were /a, o, i, u, e/ in this order for the four corpora. Next come /k/ and a syllabic nasal /N/.
There was not so much difference in the occurrence frequencies of phonemes and moras among the four corpora. Some differences were seen in the joint occurrence frequencies of two phonemes among the corpora. An elongated vowel /o-/ had the highest occurrence frequency for ATR corpus which is quite different from others. The joint occurrence frequencies of two moras in ATR PB sentences indicated the high occurrence frequencies for /aru, nai, Qte, Qta/, reflecting characteristics of written text. Three other corpora exhibited the high occurrence frequencies for /desu, masu, Nde/, reflecting characteristics of spoken dialogue. ATR PB sentences had the highest entropy, which indicates phonetic balance. ASJ various guide task sentences had the lowest coverage of phonemes and moras. As a result, it would be possible to infer the characteristics of the text by noticing mora pairs of high joint occurrence frequency.
Next, we took up telephone shopping dialogues as an example of spoken dialogue recording. Reasons are that, recently, many people are much familiar with telephone shopping, and that it is rather easy to avoid the case where only one speaker speaks more because a customer orders merchandise through a catalogue. We used high-quality microphones instead of telephone with each speaker separated by a shield; the dialogue was recorded by DAT.
Speakers playing the role of a customer were asked to read the merchandise list of 158 items as a practice of pronunciation as well as checking the recording level. Customer speakers placed an order using a data sheet which includes customer name, customer number, address, telephone number and credit card number as well as name/company name, address, telephone number for delivery.
The customer's name is the speaker's real one while name/company name and address for delivery are imaginary. The address is a combination of real names of prefectures, cities, towns and streets. Telephone and credit card numbers are, of course, imaginary with the same number of digits as real ones. Area codes of telephone numbers are chosen so as to conform to the corresponding address. The number of merchandise is 224 items chosen from a catalogue. A customer speaker recorded at least two dialogues without customer number at first and with it second time and after. A trained speaker played a receptionist accepting orders according to a manual. An assistant helped her calculate the total amount. Forty seven telephone shopping dialogues were recorded by 13 male and six female speakers.
The speakers were university students in their twenties. Out of 47 dialogues, 35 dialogues were considered favorable with the shortest utterance time of 3 minutes 2 seconds, the longest of 8 minutes 36 seconds and 4 minutes 30 seconds on average. So far, we have transcribed 10 dialogues into text. The smallest number of utterances was 89, the largest 216 and the average 123. The smallest number of characters oer utterance was two, the largest 71 and average 11. Most of the dialogues will be converted into a CD-ROM.

Keywords: speech corpus, database, statistics, telephone shopping, spoken dialogue