Abstract |
We discovered a new class of tensor factorization called positive semidefinite tensor factorization (PSDTF) that decomposes a set of PSD matrices (observations) into the convex combinations of fewer PSD matrices (bases). PSDTF can be viewed as a mathematically natural extension of nonnegative matrix factorization (NMF) that decomposes a set of nonnegative vectors (observations) into the convex combinations of fewer nonnegative vectors (bases).
References |
Matlab source codes available (2-clause BSD license, Octave possibly compatible) [Code]
Demonstration |
We used three mixture audio signals each of which was synthesized using piano sounds (011PFNOM), electric guitar sounds (131EGLPM), or clarinet sounds (311CLNOM) recorded in the RWC Music Database: Musical Instrument Sound. Each mixture signal was made by concatenating seven 2.0-s isolated or mixture sounds (C4, E4, G4, C4+E4, C4+G4, E4+G4, and C4+E4+G4). The resulting 14.0-s signals were sampled at 16kHz. The task was to separate each mixture signal into three source signals respectively corresponding to C4, E4, and G4. The signal was analyzed by using a Gaussian window with a width of 512 samples (M=512) and a shifting interval of 160 samples (N=1400). The PSD matrices and their activations were estimated by using the MU algorithm with K=3. For comparison, we used KL-NMF for amplitude-spectrogram decomposition and IS-NMF for power-spectrogram decomposition. We evaluated the quality of separated signals in terms of source-to-distortion ratio (SDR), source-to-interferences ratio (SIR), and sources-to-artifacts ratio (SAR) using the BSS Eval toolbox.
The experimental results showed the clear superiority of LD-PSDTF for source separation. The average SDR, SIR, and SAR were 17.7 dB, 22.2 dB, and 19.7 dB in KL-NMF, 19.1 dB, 24.0 dB, and 21.0 dB in IS-NMF, and 23.0 dB, 27.7 dB, and 25.1 dB in LD-PSDTF. We found it effective to initialize LD-PSDTF by using basis vectors and their activations obtained by IS-NMF for reducing the computational cost and avoiding the local optima.
Observed mixture signal (piano: 011PFNOM)
Source signals of pitch C | Source signals of pitch E | Source signals of pitch G | |
Original | |||
KL-NMF | SDR 17.4dB, SIR 20.6dB, SAR 19.8dB | SDR 16.6dB, SIR 21.1dB, SAR 18.5dB | SDR 19.9dB, SIR 23.6dB, SAR 22.3dB |
IS-NMF | SDR 18.9dB, SIR 22.5dB, SAR 21.5dB | SDR 19.0dB, SIR 24.6dB, SAR 20.5dB | SDR 20.4dB, SIR 22.9dB, SAR 23.9dB |
LD-PSDTF | SDR 21.5dB, SIR 25.2dB, SAR 24.0dB | SDR 23.8dB, SIR 28.7dB, SAR 25.5dB | SDR 21.5dB, SIR 23.0dB, SAR 27.1dB |
Observed mixture signal (electric guitar: 131EGLPM)
Source signals of pitch C | Source signals of pitch E | Source signals of pitch G | |
Original | |||
KL-NMF | SDR 11.5dB, SIR 16.0dB, SAR 13.5dB | SDR 14.5dB, SIR 19.6dB, SAR 16.2dB | SDR 16.9dB, SIR 22.0dB, SAR 18.5dB |
IS-NMF | SDR 11.4dB, SIR 17.2dB, SAR 12.8dB | SDR 15.4dB, SIR 21.0dB, SAR 16.8dB | SDR 16.2dB, SIR 20.8dB, SAR 18.0dB |
LD-PSDTF | SDR 14.6dB, SIR 19.2dB, SAR 16.5dB | SDR 24.2dB, SIR 32.4dB, SAR 24.9dB | SDR 17.1dB, SIR 20.2dB, SAR 20.0dB |
Observed mixture signal (clarinet: 311CLNOM)
Source signals of pitch C | Source signals of pitch E | Source signals of pitch G | |
Original | |||
KL-NMF | SDR 17.6dB, SIR 24.5dB, SAR 18.6dB | SDR 17.5dB, SIR 24.8dB, SAR 18.5dB | SDR 20.1dB, SIR 27.8dB, SAR 20.9dB |
IS-NMF | SDR 20.8dB, SIR 25.7dB, SAR 23.1dB | SDR 23.0dB, SIR 32.6dB, SAR 23.6dB | SDR 27.2dB, SIR 31.6dB, SAR 29.1dB |
LD-PSDTF | SDR 25.4dB, SIR 30.8dB, SAR 26.9dB | SDR 27.5dB, SIR 33.5dB, SAR 29.3dB | SDR 31.1dB, SIR 36.8dB, SAR 32.5dB |
We also tested LD-PSDTF on an audio signal synthesized by MIDI. The total length was 8.4s (N=840). The experimental results showed the overwhelming superiority of LD-PSDTF for source separation. The average SDR, SIR, and SAR were 16.7dB, 21.1dB, and 18.7dB for KL-NMF, 18.9dB, 24.1dB, and 20.5dB for IS-NMF, and 26.7dB, 33.2dB, and 27.8dB for LD-PSDTF. LD-PSDTF works very well for audio signals satisfying the assumption that basis signals are stationary.
Observed mixture signal (MIDI piano)
Source signals of pitch C | Source signals of pitch E | Source signals of pitch G | |
Original | |||
KL-NMF | SDR 17.4dB, SIR 21.9dB, SAR 19.4dB | SDR 15.5dB, SIR 21.0dB, SAR 18.5dB | SDR 16.2dB, SIR 20.6dB, SAR 18.2dB |
IS-NMF | SDR 18.3dB, SIR 23.9dB, SAR 19.7dB | SDR 20.5dB, SIR 26.2dB, SAR 21.9dB | SDR 17.9dB, SIR 22.4dB, SAR 19.8dB |
LD-PSDTF | SDR 25.5dB, SIR 33.7dB, SAR 26.2dB | SDR 30.2dB, SIR 36.4dB, SAR 31.4dB | SDR 24.2dB, SIR 29.4dB, SAR 25.8dB |
Frequency-domain amplitude-spectrogram decomposition by KL-NMF
Frequency-domain power-spectrogram decomposition by IS-NMF
Time-domain signal-covariance decomposition by LD-PSDTF