An Experiment in the Application of Semantic Feedback for Speech Understanding

Nigel WARD

Mech-Info Engineering, University of Tokyo
Bunkyo-ku, Tokyo 113 Japan
e-mail: nigel@sanpo.t.u-tokyo.ac.jp

Since the early days of speech understanding research, researchers have proposed and built many ``fully integrated'' speech understanding systems. The hope is that applying various kinds of knowledge together will give better results than applying the knowledge sources sequentially and in isolation.
This paper reports on yet another approach to fully integrated speech understanding. It fairly directly models a key human ability: that of being able to ``fill in the missing words'' of a half-understood input based on semantics. One novelty is that the semantic feedback is provided by a neural network, and so the semantic module is trainable from data.
The system consists of a word spotter, a parser, and a semantic interpreter. The parser works from the lattice of word hypotheses, and outputs some ``clues'' to a semantic interpretation. The semantic interpreter takes these clues and comes up with a refined interpretation.
The parser also supports feedback, meaning that it can use the information in the refined semantic interpretation to re-score the word hypotheses. From this once again the parser can be invoked to arrive at a re-revised semantic interpretation.
This means that the system can use semantic information to re-reason about syntactic and word hypotheses. For example, if the semantic interpreter provides the information that ``{Batman}'' was more likely responsible in this input'', then, all hypothesized appearances of the word ``{Batman}'' can be scored more or less likely, depending on whether the time span where they appear is (for syntactic reasons) scored as being likely or unlikely to be a region where agents occur.
On an 11 utterance test set, the summed semantic errors of the final interpretations were as shown in the table. The point to note is that the use of feedback based on semantic knowledge is effective (decreasing the error from 14.41 to 11.35).

The parser is the key to the success of this system; it is what computes the implications of semantic knowledge for the likelihoods of word hypotheses.
The key idea is that it is possible to represent syntactic knowledge as an inventory of constructions, analogous to the representation of lexical knowledge as an inventory of words. Each construction is a pairing of form and meaning. The form is a sequence of constituents. Examples of constructions are the Subject-Predicate Construction, the Transitive Construction, the Adjective-Noun Construction, and the Passive Construction. It seems that constructions are a representational mechanism adequate for writing complete grammars; and doing so is the enterprise of ``Construction Grammar''.
Adopting this view of grammar allows a parser to use ``construction hypotheses'', that is, hypotheses of the form ``constituent X of construction Y is present over time span Z''. Such ``construction hypotheses'' have the advantages of being simple, being suited to consideration in parallel, being easily scorable based on word hypothesis scores, and relating directly to semantics.
A construction hypothesis can be spawned when there is a good match between the constituents of the construction and the hypothesized words in some time range.
A construction hypothesis which spans a certain time range can be used for interpreting that part of the input. To give just one example, suppose that: A. there is a hypothesized occurrence of the Subject-Predicate Construction for which the first constituent spans the time span from the 10th to the 22nd frame, and B. an occurrence of the word ``{John}'' is hypothesized in the time span from the 11th to the 19th frame. From this, since the time spans overlap, there is evidence for ``{john}'' being the subject, and, therefore, evidence for being the topic, active, and so on.

Conversely, for feedback, semantic rescoring of such clues causes rescoring of construction hypotheses and lexical hypotheses. For example, for the Batman example of \S 2, the information about the semantic role of ``{Batman}'' (is ``responsible''), plus the placements of the highly-scored construction hypotheses pertaining to responsibility (for example, the first constituent of the subject-predicate construction marks responsibility), gives information on where in the input, if at all, the word ``{Batman}'' is likely to have appeared. Note that the parser at any time can have many incompatible construction hypotheses; it never arrives at a single consistent interpretation. This means that it is easy to rescore the various hypotheses as more information comes in; and the most highly scored hypotheses dominate computations.
In that it makes syntactic hypotheses parallel, scored, and independent, this approach combines the best points from chart parsing, probabilistic parsing, and partial parsing, respectively.
Directions: 1. The current approach should be useful for simple tasks where there is rich semantic knowledge, and possibly poor recognition results; therefore I plan to use it to build a voice-input-enhanced graphical user interface to a simple program. 2. So far the system provides quantitatively good results for only one of the possible feedback pathways, more work on this is needed. 3. The parser (and semantic interpreter) need on-line versions, so that it is possible to exploit the left context to guide the recognizer as it goes along. 4. It is necessary to investigate whether these ideas are also applicable to larger-vocabulary, more complex tasks.

Keywords: integrated, speech understanding, syntax, clues, feedback