Stochastic Language Design for Speech Recognition

Toru HITAKA and Yoichi TOMIURA

Department of Computer and Communication Engineering, Faculty of Engineering,
Kyushu University
6-10-1 Hakozaki, Higashi-ku, Fukuoka 810, JAPAN

Stochastic Context Free Grammars are widely used as a convenient probabilistic language model for speech recognition. The stochastic grammar has parameters to be set accordingly to the set ${F}$ of sample sentences with grammatical structure. In previous studies of the parameter estimation, as known as most likelihood estimation, some evaluation function on parameters and on {F} is introduced first, then the values of parameters which maximize the value of the function are chosen. But this type of parameter estimation theory so far lacks something very important, that is {consistency}. Iff estimated values of parameters converge upon true values in probability when the number of samples becomes large, that estimation is called {consistent}. One of our study objectives is to find a consistent parameter estimation for stochastic context free grammars.
A stochastic grammar is called inconsistent, iff the sum of probabilities of all sentences is smaller than 1. In our further study, it will be proved that sample set of grammatical structures of sentences don't reflect on true values of parameters. This means that for an inconsistent grammar, the expected applied number of a rewriting rule is not proportionate to the true value of the parameter attached to the rule, so it is hard to come up with a consistent parameter estimation for {inconsistent} stochastic context free grammars. Therefore, characterization of {consistent} (or { inconsistent}) grammar is very important for our research.
Existence of a inconsistent stochastic context free grammar was first introduced by T.L.Booth and J.D.Thompson in 1973 and they gave a sufficient condition on which a stochastic context free grammar is {consistent}. But this condition is proved to be only a sufficient one, not a necessary one, in our research and they offered no procedure to compute the sum of probabilities of all sentences.
We concentrated our research mainly on the characterization of { consistency} (or {inconsistency}) of stochastic context free grammar this year and solved almost all problems related to the subject. Our main results are as follows.

(1) Characteristic Equations for which the sum of probabilities of all sentences is one of the solutions are introduced from the definition of stochastic context free grammars.

(2) The sum of probabilities of all sentences is characterized as the smallest positive solution of the associated characteristic equations.

(3) Asymptotic Expansion Formula for the smallest positive solution of characteristic equations is given, so that the sum of probabilities of all sentences can be computed with adequate accuracy.

(4) All non-ill-formed stochastic regular grammars, in which every nonterminal symbol derives at least a string of terminal symbols and is derived in a sentential form from the starting symbol {S}, are consistent.

Keywords: parameter estimation, consistent estimation, stochastic grammar