F0 Modeling with Additive Models for Corpus-based Speech Synthesis

(paper pdfs are available here)

What   We propose a novel multi-layer approach to fundamental frequency (F0) modeling for speech synthesis based on a statistical learning technique called additive models [2]. In a two-layer modeling, the whole F0 contour model consists of a long-term, intonational phrase-level component and a short-term, accentual phrase-level component. The model is automatically trained from a corpus by a backfitting algorithm, which optimizes a regularized least square criterion. We applied the proposed method to Japanese speech and achieved a state-of-the-art accuracy in F0 contour prediction.

Why   In recent years, corpus-based concatenative methods for speech synthesis have received increasing attention because of their ability to generate natural sounding speech output [3]. We have developed an FST-based framework for corpus-based speech synthesis and successfully deployed it in our spoken dialog systems [5]. We have been so far successful in synthesizing good-quality speech without handling prosodic aspects such as F0 contour or duration by developing synthesizers for particular task domains with naturally constrained responses. We are now aiming at improving our synthesis framework to be applicable to unrestricted domains. Then, for synthesized speech to be natural and understandable, it is crucial to have a proper F0 contour that is compatible with the linguistic information such as lexical accent (or stress) and phrasing in the input text.


Figure 1: A schematic diagram of the additive F0 model.

How   In our two-layer model, the F0 contour Y is expressed as the output of a statistical model which is a superposition of long-term, intonational-phrase level component g and a shorter accentual-phrase level component h,
equation1   (1)
where alpha is a constant, I is a symbolic-valued variable that represents a type of intonational phrase and it indexes the relevant function g_I. U is a continuous variable representing a time point relative to the starting point of the phrase of type I. Similarly, A designates a type of accentual phrase and V represents a time point relative to the starting point of the accentual phrase of type A. The random error term \epsilon has mean zero. Figure 1 shows how three terms sums up to form the whole F0 contour function. A unique characteristic of this approach is that we do not have to assume any parameterized functional form but just assume a smoothness defined in terms of curvature, and the estimation scheme is derived from the least-square criterion with roughness penalty [2]. The penalized residual sum of squared errors has the following form:
equation2  (2)

where {(i_n, u_n, a_n, v_n, y_n)| n = 1, ..., N} is a set of training data, corresponding to the variables (I, U, A, V, Y) and \lambda_g, \lambda_h are fixed smoothing parameters. r(I) and r(A) represents the set of possible values for I and A, respectively. The first term measures the closeness to the data, while the second and third terms penalize the curvatures in the functions, and \lambda_g and \lambda_h establish a tradeoff between them. It can be shown that the minimizer of the equation (2) is an additive cubic spline model, where each of g_I and h_A is natural cubic spline, and we can find the solution for it with a simple iterative procedure called backfitting [2].

Progress   We first applied the proposed method to F0 modeling of Japanese speech. In this implemetation, intonational phrase type I represents the number of moras (or syllables) in the intonational phrase. The accentual phrase type A is a pair (m, n), where m is the number of moras in it and n is the position of accent nucleus. We estimated component functions g_i's and h_a's in the log frequency domain using a corpus of Japanese utterances read by a female speaker. The corpus comprised 7,282 utterances.


Figure 2: F0 contour from the model, displayed with the actual F0 contour. The orange dots are the F0 data in the test data, and the blue dots are the F0 contour derived from the additive F0 model

Figure 2 shows an example of the estimated F0 contour plotted with the actual F0 in the test data. As a preliminary evaluation, we measured the goodness of fit in terms of root mean square error(RMSE) and correlation coefficient(Corr). The RMSE was 29.8 (Hz) and the Corr was 0.77 on the test data. The standard deviation of the corpus F0 was 48.2Hz. Although it can be difficult to compare performance across different speech corpora and languages, we believe these results are comparable to state-of-the-art results of 33--34 Hz RMSE, and 0.6--0.72 Corr, that have been reported on a female-speaker English radio news corpus [1,4] with the standard deviation reported as e.g. 53Hz in [1].

Future   We plan to incorporate the F0 measures predicted by the model into our speech synthesis system. We also plan to apply this framework for F0 modeling of English, for more general purpose concatenative speech synthesis.


  1. K. Dusterhoff and A. Black and P. Taylor, "Using Decision Trees within the Tilt Intonation Model to Predict F0 Contours", EUROSPEECH-99, 1999.
  2. T. Hastie and R. Tibshirani, "Generalized Additive Models", Chapman and Hall, 1990.
  3. A. Hunt and A. Black, "Unit Selection in a Concatenative Speech Synthesis System Using a Large Speech Database", ICASSP-96, 1996, pp.373--376.
  4. X. Sun, "F0 Generation for Speech Synthesis Using a Multi-Tier Approach", Proc. ICSLP-2002, pp.2077--2080, Denver, 2002.
  5. J. Yi and J. Glass and L. Hetherington, "A Flexible, Scalable Finite-State Transducer Architecture for Corpus-Based Concatenative Speech Synthesis", Proc. ICSLP-2000, pp.322-325, Beijing, Oct, 2000.
Last update: 7/30/2004