Demo Page

Semi-supervised Multichannel Speech Enhancement with a Deep Speech Prior   (GitHub)

K. Sekiguchi, Y. Bando, A. A. Nugraha, K. Yoshii, and T. Kawahara

Abstract: This paper describes a semi-supervised multichannel speech enhancement method that only uses clean speech data for prior training. Although multichannel nonnegative matrix factorization (MNMF) and its constrained variant called independent low-rank matrix analysis (ILRMA) have successfully been used for unsupervised speech enhancement, the low-rank assumption on the power spectral densities (PSDs) of all sources (speech and noise) does not hold in reality. To solve this problem, we replace a low-rank model of speech with a deep generative model in the framework of MNMF or ILRMA, i.e., formulate a probabilistic model of noisy speech by integrating a deep speech model, a low-rank noise model, and a full-rank or rank-1 model of spatial characteristics of speech and noise. The deep speech model is trained from clean speech data in an unsupervised auto-encoding variational Bayesian manner. Given multichannel noisy speech spectra, the full-rank or rank-1 spatial covariance matrices and PSDs of speech and noise are estimated in an unsupervised maximum-likelihood manner. Experimental results showed that the full-rank version of the proposed method was significantly better than MNMF, ILRMA, and the rank-1 version. We confirmed that the initialization-sensitivity and local-optimum problems of MNMF with many spatial parameters can be solved by incorporating the precise speech model.

Enhancement results for CHiME-3 evaluation set (BUS: F06_445C0211_BUS)

As the outputs of all methods are 5-channel source images, we show the images of fifth channel. When the number of noise is larger than 1, we show the summation over all noise sources.

Ground truth source image

Observation (channel 5)

MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-NMF (semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-NMF (semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-VQ (semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-VQ (semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)

Enhancement results for CHiME-3 evaluation set (CAF: M06_445C0205_CAF)

Ground truth source image

Observation (channel 5)

MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-NMF (semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-NMF (semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-VQ (semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-VQ (semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)

Enhancement results for CHiME-3 evaluation set (PED: M05_441C0211_PED)

Ground truth source image

Observation (channel 5)

MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-NMF (Semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-NMF (Semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-VQ (Semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-VQ (Semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)

Enhancement results for CHiME-3 evaluation set (STR: F06_446C0204_STR)

Ground truth source image

Observation (channel 5)

MNMF with a Deep Speech Prior (MNMF-DP, semi-supervised) (Number of noise = 1, Number of noise bases = 64)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA with a Deep Speech Prior (ILRMA-DP, semi-supervised) (Number of noise = 4, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA (Number of noise = 4, Number of speech bases = 8, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-NMF (Semi-supervised) (Number of noise = 1, Number of speech bases = 4, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-NMF (Semi-supervised) (Number of noise = 4, Number of speech bases = 16, Number of noise bases = 1)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
MNMF-VQ (Semi-supervised) (Number of noise = 1, Number of speech bases = 8, Number of noise bases = 256)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)
ILRMA-VQ (Semi-supervised) (Number of noise = 4, Number of speech bases = 256, Number of noise bases = 2)
The estimated speech PSD

The separated speech (ch.5)

The separated speech (ch.5)
The estimated noise PSD

The separated noise (ch.5)

The separated noise (ch.5)

Eestimated Time Difference of Arrivals (TDOAs)

In order to confirm that the estimated SCMs or separation matrices encode the correct spatial information. we theoretically calculated the TDOAs from the estimated SCMs or separation matrices and from the microphone-array geometry shown in the CHiME3 web page. Each dot indicates the TDOA calculated from the SCM of a frequency (dots are sorted in an ascending order of frequency from left to right for each channel). Horizontal black bars indicate the TDOAs calculated from the microphone positions. The TDOAs estimated from the SCMs (dots in the figure) are close to those estimated from the microphone positions (horizontal bars). This shows that all methods can correctly extract spatial information. (Although the TDOAs calculated geometrically are not necessarily correct, because the speaker position is unknown.)
F05_447C020Q_BUS
MNMF with a Deep Speech Prior (MNMF-DP)

ILRMA with a Deep Speech Prior (ILRMA-DP)

MNMF

ILRMA

F06_446C0204_STR
MNMF with a Deep Speech Prior (MNMF-DP)

ILRMA with a Deep Speech Prior (ILRMA-DP)

MNMF

ILRMA

Reference

[1] J. Barker+, "The third 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines," IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU), 2015, [Link].