K. Sekiguchi, Y. Bando, A. A. Nugraha, K. Yoshii, and T. Kawahara
Abstract: This paper describes a semi-supervised multichannel speech enhancement method that only uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank model of speech with a deep generative model in the framework of MNMF or ILRMA, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.
As the outputs of all methods are 5-channel source images, we show the images of fifth channel. When the number of noise is larger than 1, we show the summation over all noise sources.
MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
MNMF-NMF (semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
ILRMA-NMF (semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
MNMF-VQ (semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA-VQ (semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
MNMF-NMF (semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
ILRMA-NMF (semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
MNMF-VQ (semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA-VQ (semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
MNMF-NMF (Semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
ILRMA-NMF (Semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
MNMF-VQ (Semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA-VQ (Semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
MNMF-NMF (Semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
ILRMA-NMF (Semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
MNMF-VQ (Semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
ILRMA-VQ (Semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
In order to confirm that the estimated SCMs or separation matrices encode the correct spatial information. we theoretically calculated the TDOAs from the estimated SCMs or separation matrices and from the microphone-array geometry shown in the CHiME3 web page. Each dot indicates the TDOA calculated from the SCM of a frequency (dots are sorted in an ascending order of frequency from left to right for each channel). Horizontal black bars indicate the TDOAs calculated from the microphone positions. The TDOAs estimated from the SCMs (dots in the figure) are close to those estimated from the microphone positions (horizontal bars). This shows that all methods can correctly extract spatial information. (Although the TDOAs calculated geometrically are not necessarily correct, because the speaker position is unknown.)
F05_447C020Q_BUS
F06_446C0204_STR
[1] J. Barker+, "The third 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines," IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU), 2015, [Link].