Acoustic Analysis of Dialog Speech

Fumitada ITAKURA and Shoji KAJITA

School of Engineering, Nagoya University
Furo-cho 1, Chikusa-ku, Nagoya 464-01, JAPAN

In dialog speech recognition systems, the speech analysis part is an essential front end using the acoustic signal processing. Since the acoustic features lost there cannot be easily recovered in later stages, what features and how to extract them from acoustic signals is one of the most important problems in speech recognition. Therefore, more intensive research on acoustic analysis is required to achieve a robust speech recognition in a realistic acoustic environment.
In order to tackle the problem, we have proposed a technique based on subband processing and autocorrelation analysis, namely, subband-autocorrelation (SBCOR) analysis. This SBCOR analysis has been developed so as to extract periodicities associated with the delay time equal to inverses of the center frequencies. The SBCOR has been shown to be robust under the multiplicative signal-dependent white noise that has constant SNRs at any points.
In this paper, we investigate to what extent the SBCOR analysis is robust against severe waveform distortion and noises.
First, it is shown that SBCOR is robust against severe waveform distortions such as zero-crossing. Although the zero-crossing distortion deteriorates the performance of conventional recognition systems, such distorted signals are still intelligible for humans. The analysis examples of SBCOR and smoothed group delay spectrum (SGDS) show that the SBCOR spectrum is stable for such distortion, while the zero-crossing distortion influences significantly the formant structure extracted by SGDS.
In the recognition experiments, a standard DTW speaker-dependent isolated word recognizer is used. The recognition task is a 68 pair discrimination. Each pair is a phonetically similar city name pair, selected from a 550 Japanese city name database recorded twice by 5 Japanese male speakers. The first set is used as the reference pattern and the second set, which was spoken a week later, is used as the test pattern. The test signals are distorted by zero-crossing. The experimental results using a DTW word recognition show that the SBCOR (Q=1.0) performs about 19% higher than SGDS, when the test signals are distorted by zero-crossing. These results indicates that the speech features are much more robust against the zero-crossing distortion.
Second, it is shown that SBCOR is more robust against multiplicative signal-dependent white noise, Gaussian white noise, and a human speech-like noise than SGDS. The human speech-like noise were generated by superposing independent speech waveforms of 3200 phrases spoken by 30 males and 34 females in the Continuous Speech Corpus for Research edited by the ASJ. The experimental results based on the DTW word recognition, which is the same as the above one, show the SBCOR spectrum performs equally as well as the SGDS under clean conditions, and better than it under noisy conditions, for all noises. Besides, the best Q for the white noises is 1.5, while the best one for the human speech-like noise is 2.0. The reason seems to be that the noise effects due to the low frequencies can be attenuated by narrowing the band width. The effectiveness of the SBCOR is larger when the noise is white than when the noise is the human speech noise.
Finally, we evaluate the robustness at phonemic level. The task is 23 phoneme speaker-dependent recognition for the /a,i,u,e,o,b,d,g,m,n,N,p,t,k,s,h,r,y,w,z,ts,ch,sh/ using HMMs. Each HMM is left-to-right and seven mixture HMM. The parameter estimation was performed using the 2620 even-numbered words in the ATR Japanese 5240 speech database (two male and two female speakers). The speech data for tests were collected from the odd-numbered 2620. The sampling frequency is 10 kHz. To examine the robustness against noise, the multiplicative signal-dependent white noise is added to the database for tests. The experimental results show the best Q becomes low gradually as the SNR falls. When it is taken into account that the best Q for low SNR is not the best for high SNR and vice versa, the best Q is 1.5. Moreover, although the performance of the SBCOR (Q=1.5) is slightly worse than that of SGDS under clean conditions, the SBCOR performs much better than the SGDS under SNR 20(+6%) and 10dB(+15%).
In this paper, we showed that the SBCOR is robust against severe waveform distortion such as zero-crossing and three types of noise using a DTW recognizer. This results indicate that the SBCOR extracts the speech features that are not captured sufficiently by conventional speech analyses. As for the robustness at phonemic level, we could verify it as long as the noise is the multiplicative signal-dependent white noise. For the other noises, we should investigate further.

Keywords: SBCOR analysis, waveform distortion, noise, DTW, HMM