Toru HITAKA and Yoichi TOMIURA

Department of Computer and Communication Engineering,
Faculty of Engineering,

Kyushu University

6-10-1 Hakozaki, Higashi-ku, Fukuoka 810, JAPAN

e-mail: hitaka@lang.ai.kyushu-u.ac.jp

Stochastic Context Free Grammars are widely used as a convenient
probabilistic language model for speech recognition. The stochastic
grammar has parameters to be set accordingly to the set ${F}$ of
sample sentences with grammatical structure. In previous studies of
the parameter estimation, as known as most likelihood estimation, some
evaluation function on parameters and on {F} is introduced
first, then the values of parameters which maximize the value of the
function are chosen. But this type of parameter estimation theory so
far lacks something very important, that is {consistency}. Iff
estimated values of parameters converge upon true values in
probability when the number of samples becomes large, that estimation
is called {consistent}. One of our study objectives is to find a
consistent parameter estimation for stochastic context free grammars.

A stochastic grammar is called inconsistent, iff the sum of
probabilities of all sentences is smaller than 1. In our further
study, it will be proved that sample set of grammatical structures of
sentences don't reflect on true values of parameters. This means that
for an inconsistent grammar, the expected applied number of a
rewriting rule is not proportionate to the true value of the parameter
attached to the rule, so it is hard to come up with a consistent
parameter estimation for {inconsistent} stochastic context free
grammars. Therefore, characterization of {consistent} (or {
inconsistent}) grammar is very important for our research.

Existence of a inconsistent stochastic context free grammar was first
introduced by T.L.Booth and J.D.Thompson in 1973 and they gave a
sufficient condition on which a stochastic context free grammar is
{consistent}. But this condition is proved to be only a
sufficient one, not a necessary one, in our research and they offered
no procedure to compute the sum of probabilities of all sentences.

We concentrated our research mainly on the characterization of {
consistency} (or {inconsistency}) of stochastic context free grammar
this year and solved almost all problems related to the subject. Our
main results are as follows.

(1) Characteristic Equations for which the sum of probabilities of all
sentences is one of the solutions are introduced from the definition
of stochastic context free grammars.

(2) The sum of probabilities of all sentences is characterized as the
smallest positive solution of the associated characteristic equations.

(3) Asymptotic Expansion Formula for the smallest positive solution of
characteristic equations is given, so that the sum of probabilities of
all sentences can be computed with adequate accuracy.

(4) All non-ill-formed stochastic regular grammars, in which every
nonterminal symbol derives at least a string of terminal symbols and
is derived in a sentential form from the starting symbol {S},
are consistent.

Keywords: parameter estimation, consistent estimation, stochastic grammar