Acoustic Analysis of Dialogue Speech

Hidetsugu ICHIKAWA, and Junichi NISHIYAMA

Department of Computer Science, Shizuoka University
5-1, 3 Chome, Hamamatsu, 432, Japan

Nonverbal communications play an important role in human dialogue where participants use natural speech so called spontaneous speech. That nonverbal information interchanged between people is called as the "paralanguage". There are number of aspects of paralanguage. Some of them are sensational features, and difficult to realize measurements. The others, however, are measurable as concrete acoustical features. Here we described how to measure the speech rate. Another aspect of dialogue research is the transcription into text of the paralinguistic features such as voice loudness, tone, and speech rate. We made various measurements of validity and consistency of descriptions between different transcribers.
The speech rate is the number of moras per second. A mora is a prosodical term that is a conjunction of a consonant and a short vowel. One technical measurement of the rate of speech is achievable through the phoneme recognition, that is, the point of time of each phoneme is marked along the time line and the resulting phoneme (hence the mora) lengths are averaged along some intervals resulting a number of moras per second. This definition, however, is difficult to realize with automatic speech recognition technology, but possible only by hand labeling. There are several possible hypotheses of perception of the rate of speech. We estimate the tempo phoneme independently and do not exploit speech recognition technology, since we assume we can recognize the rate of speech without recognizing the content of speech. We can perceive, for example, the utterance speed from the narrow band filtered speech sound. This fact suggests that our sense of a tempo can be perceivable from the envelope of the waveform.

In the preceding study, the rhythm is correlated to the interval of the center of energy between adjacent syllables.
Because Japanese has the CV-syllable-timed feature, downswings of the envelope appear at every consonant segment almost at the same interval. A speech envelope changes dynamically from a consonant to a vowel and then to the next consonant forming peaks and valleys. The intervals between peaks and valleys are expected to be approximately equal or periodic because of the syllable-timed feature. If this is true, then we can extract this periodicity through the following procedure: the DFT (Discrete Fourier Transform) of the Hamming windowed envelope pattern. We employed window size about one second through our experiment. The window includes local pauses and non-lexical voicing due to non-verbal or paralinguistic expressions. In order to obtain an envelope of the speech waveform, we first rectified the wave to obtain a half-wave, on which then we low-pass-filtered to obtain an approximate envelope. We designed a low pass filter of the cutoff frequency at 80 Hz to keep the envelope details. This filter deals about ten times of the average mora per second.
The speaking rate is observed as a dominant spectral peak in a frequency domain, where the speaking rate is visually represented in frequency-time plane like the formantic pattern of spectrograms. The frequency and the time are scaled downward to one two hundredth of the 8 kHz sampling rate of the normal spectrogram. We could observe gray gauged monochrome patterns in the 20 Hz frequency region with one second of time window.
We employed the bandpass filtered speech with the auditory model for the source of the speech envelope. In both speech waves of bandpass filtered and full-band, we could observe in the spectrogram of the wave envelope such concentrated spectral energy around the frequency corresponding to the speech rate.
First, we examined with the synthesized sounds such as a stationary half wave rectified signal with a short silent gap corresponding to a consonant interval, and then gradually decreasing intervals between envelope peaks. These test signals are processed according to the described procedure to obtain spectrograms of the envelope. We could observe dark bars corresponding to the speech rates.
We examined real dialogue speech taken from TV programs. The real speech rate is measured manually by segmentation of individual phonemes. Then we computed DFTs of the envelop waves to find the frequency of the peak energy as an estimate of the speech rate. The manual estimation and the DFT estimation correlated with coefficient 0.57.
Spectrograms of the speech envelope of real speech show complicated texture than the test signals. Therefore, it was difficult to recognize the speech rate as a unique dark bar pattern.
Text encoded dialogue is a useful form of analysis data, however, consistency between transcribers and omission and misleading are unavoidable. We estimated these errors and inconsistencies with cross check between different transcribers. Their descriptions are according to the TEI encoding scheme of utterances and paralinguistic descriptions of real dialogue. We could find 92% agreements between different transcribers of phonetic transcriptions. We found 50% of disagreements of nonverbal transcriptions. These results suggest us that some acoustic parameters will help consistent description of nonverbal features.

Keywords: dialogue speech, speech rate, TEI, evaluation of description