An Experiment in the Application of Semantic Feedback for Speech Understanding
Nigel WARD
Mech-Info Engineering, University of Tokyo
Bunkyo-ku, Tokyo 113 Japan
e-mail: nigel@sanpo.t.u-tokyo.ac.jp
Since the early days of speech understanding research, researchers have
proposed and built many ``fully integrated'' speech understanding
systems. The hope is that applying various kinds of knowledge
together will give better results than applying the knowledge sources
sequentially and in isolation.
This paper reports on yet another approach to fully integrated
speech understanding. It fairly directly models a key human ability:
that of being able to ``fill in the missing words'' of a
half-understood input based on semantics. One novelty is that the
semantic feedback is provided by a neural network, and so the semantic
module is trainable from data.
The system consists of a word spotter, a parser, and a semantic
interpreter. The parser works from the lattice of word hypotheses,
and outputs some ``clues'' to a semantic interpretation. The semantic
interpreter takes these clues and comes up with a refined
interpretation.
The parser also supports feedback, meaning that it can use the
information in the refined semantic interpretation to re-score the
word hypotheses. From this once again the parser can be invoked to
arrive at a re-revised semantic interpretation.
This means that the system can use semantic information to re-reason
about syntactic and word hypotheses. For example, if the semantic
interpreter provides the information that ``{Batman}'' was more likely
responsible in this input'', then, all hypothesized appearances of the
word ``{Batman}'' can be scored more or less likely, depending on
whether the time span where they appear is (for syntactic reasons)
scored as being likely or unlikely to be a region where agents occur.
On an 11 utterance test set, the summed semantic errors of the final
interpretations were as shown in the table. The point to note is that
the use of feedback based on semantic knowledge is effective
(decreasing the error from 14.41 to 11.35).
The parser is the key to the success of this system; it is what
computes the implications of semantic knowledge for the likelihoods of
word hypotheses.
The key idea is that it is
possible to represent syntactic knowledge as an inventory of
constructions, analogous to the representation of lexical knowledge
as an inventory of words. Each construction is a pairing of form and
meaning. The form is a sequence of constituents. Examples of
constructions are the Subject-Predicate Construction, the Transitive Construction,
the Adjective-Noun Construction, and the Passive Construction.
It seems that constructions are a representational mechanism adequate for
writing complete grammars;
and doing so is the enterprise of ``Construction Grammar''.
Adopting this view of grammar allows a parser to use ``construction
hypotheses'', that is, hypotheses of the form ``constituent X of
construction Y is present over time span Z''. Such ``construction
hypotheses'' have the advantages of being simple, being suited to
consideration in parallel, being easily scorable based on word
hypothesis scores, and relating directly to semantics.
A construction hypothesis can be spawned when there is a good match
between the constituents of the construction and the hypothesized
words in some time range.
A construction hypothesis which spans a certain time range can be used
for interpreting that part of the input. To give just one example,
suppose that: A. there is a hypothesized occurrence of the
Subject-Predicate Construction for which the first constituent spans
the time span from the 10th to the 22nd frame, and B. an occurrence of
the word ``{John}'' is hypothesized in the time span from the 11th to
the 19th frame. From this, since the time spans
overlap, there is evidence for ``{john}'' being the subject, and,
therefore, evidence for being the topic, active, and so on.
Conversely, for feedback, semantic rescoring of such clues causes
rescoring of construction hypotheses and lexical hypotheses. For
example, for the Batman example of \S 2, the information about the semantic
role of ``{Batman}'' (is ``responsible''), plus the placements of the
highly-scored construction hypotheses pertaining to responsibility
(for example, the first constituent of the subject-predicate construction
marks responsibility), gives information on where in the input, if at all,
the word ``{Batman}'' is likely to have appeared. Note that the parser
at any time can have many incompatible construction hypotheses; it
never arrives at a single consistent interpretation. This means that
it is easy to rescore the various hypotheses as more information comes
in; and the most highly scored hypotheses dominate computations.
In that it makes syntactic hypotheses parallel, scored, and
independent, this approach combines the best points from chart
parsing, probabilistic parsing, and partial parsing, respectively.
Directions:
1. The current approach should be useful for simple tasks where there is rich
semantic knowledge, and possibly poor recognition results; therefore I plan to
use it to build a voice-input-enhanced graphical user interface to a
simple program.
2. So far the system provides quantitatively good results for only one of
the possible feedback pathways, more work on this is needed.
3. The parser (and semantic interpreter) need on-line versions, so that
it is possible to exploit the left context to guide the recognizer as
it goes along.
4. It is necessary to investigate whether these ideas are also applicable to
larger-vocabulary, more complex tasks.
Keywords: integrated, speech understanding, syntax, clues, feedback