Demo Page

This is the accompanying web page for the following article:

Audio-to-Score Singing Transcription Based on a CRNN-HSMM Hybrid Model

Abstract

This paper describes an automatic singing transcription (AST) method that estimates a human-readable musical score of a sung melody from an input music signal. Because of the considerable pitch and temporal variation of singing voice, a naive cascading approach that estimates an F0 contour and quantizes it with estimated tatum times cannot avoid many pitch and rhythm errors. To solve this problem, we formulate a unified generative model of a music signal that consists of a semi-Markov language model representing the generative process of latent musical notes conditioned on musical keys and an acoustic model based on a convolutional recurrent neural network (CRNN) representing the generative process of an observed music signal from the notes. The resulting CRNN-HSMM hybrid model enables us to estimate the most-likely musical notes from a music signal with the Viterbi algorithm, while leveraging both the grammatical knowledge about musical notes and the expressive power of the CRNN. The experimental results showed that the proposed method outperformed the conventional state-of-the-art method and the integration of the musical language model with the acoustic model has a positive effect on the AST performance.

Examples

The following examples are estimated by the proposed method, the CRNN-based method, the HHSMM-based method [1], and the majority-vote methods from the singing voices separated by [2] and the tatum times estimated by [3]. For copyright reasons, each input mixture signal contains only the first verse [4].

Example 1 RWC-MDB-P-2001 No. 7

Input mixture signal
Transcribed musical scores [link]

Example 2 RWC-MDB-P-2001 No. 8

Input mixture signal
Transcribed musical scores [link]

Example 3 RWC-MDB-P-2001 No. 18

Input mixture signal
Transcribed musical scores [link]

Example 4 RWC-MDB-P-2001 No. 20

Input mixture signal
Transcribed musical scores [link]

References

[1] R. Nishikimi, E. Nakamura, M. Goto, K. Itoyama, and K. Yoshii, "Bayesian Singing Transcription Based on a Hierarchical Generative Model of Keys, Musical Notes, and F0 Trajectories," IEEE/ACM Transaction on Audio, Speech, and Language Processing, vol. 28, pp. 1678-1691, 2020.

[2] R. Hennequin. A. Khlif, F. Voituret, and M. Moussallam, "Spleeter: a Fast and Efficient Music Source Separation Tool with Pre-trained Models," Journal of Open Source Software, vol. 5, pp. 2154, 2020.

[3] S. Böck, F. Korzeniowsk, J. Schlüter, F. Krebs, and G. Widmer, "madmom: a new Python Audo and Music Signal Processing Library," in ACM International Conference on Multimedia, pp. 1174-1178, 2016.

[4] M. Goto, "Aist annotation for the RWC music database," in International Conference on Music Information Retrieval, pages 359–360, 2006.