ASRU 2007 Demonstrations
- Dialogue
Navigator for Kyoto City: An Interactive Information Guidance System using
Question-Answering Technique
Teruhisa Misu and Tatsuya Kawahara (Kyoto Univ., Japan)
We propose an interactive framework for information navigation based on
document knowledge base. In conventional audio guidance systems, such as
those deployed in museums, the information flow is one-way and the content
is fixed. To make the guidance interactive, we prepare two modes, a user-initiative
retrieval/QA mode (pull-mode) and a system-initiative recommendation mode
(push-mode), and switch between them according to the user's state. In the
user-initiative retrieval/QA mode, the user can ask questions about specific
facts in the documents in addition to general queries. In the system-initiative
recommendation mode, the system actively provides the information the user
would be interested in. We implemented a navigation system containing Kyoto
city information.
- Speech
Recognition for Closed-captioning Live Broadcasts
Toru Imai, Akio Kobayashi, Shoei Sato, Shinichi Homma, Takahiro Oku, and
Tohru Takagi (NHK, Japan)
NHK has been operating Japanese large-vocabulary continuous speech recognition
systems for closed-captioning some of its news, sports and other live TV
programs to help the hearing impaired and elderly since 2000. The first
implementation was for news programs where anchorpersons’ read speech
in a studio was directly recognized and any recognition error was promptly
corrected manually by operators using touch-panels and keyboards. The second
one was a so-called “re-speak” method where another speaker
listening to the original speech of the programs rephrases the commentary
so that it can be recognized for captioning live programs, e.g. music shows,
baseball games, the Grand Sumo Tournaments, the Olympic Games, and World
Cup Football Games. The captioning systems have been daily operated and
we have been receiving a large number of positive responses from hearing
impaired viewers about them. In the demonstration session, we will demonstrate
to recognize a sports commentary directly from the broadcasting studio for
a Major League Baseball (MLB) game by using a laptop PC, and synchronously
show actual broadcast of its error-corrected closed-captions on the video.
Notable technical features of the demonstration are our very low latency
decoder, which is suitable for real-time captioning, with the adapted acoustic
model to the commentator, the adapted language model to the MLB, and the
closed-captions after the immediate manual error correction.
- A Korean Point-of-Interest
Recognition System on an Embedded Device
Hoon Chung, Jeon Gue Park, Yun Keun Lee and Ikjoo Chung (ETRI, Korea)
We have developed a speech recognition system that recognizes hundreds of
thousands of item names on a resource-limited embedded device without serious
accuracy degradation for N-best result. In order to implement such an efficient
speech recognition system, we used subspace distribution clustering hidden
Markov (SDCHMM)-based acoustic models to achieve memory efficiency and proposed
a multistage based fast search scheme. The proposed algorithm is composed
of a two-stage HMM-based coarse match and a detailed match. The two-stage
HMM-based coarse match is aimed at rapidly selecting a small set of candidates
that are assumed to contain a correct hypothesis with high probability,
and the detailed match re-ranks the candidates by performing acoustic rescoring.
Principally, the algorithm shares the architecture of human speech recognition
(HSR) and multi-layered framework in that speech recognition is completed
through a three stage decoding procedure: acoustic feature to phoneme conversion,
phoneme to word conversion and word level rescoring. However, the contribution
of our work is that we present another statistical framework to deal with
the first two steps, especially optimized to maximize search speed. The
proposed system is implemented on an in-car navigation system with a 32-bit
fixed-point processor operating at 620MHz. The experimental result shows
that the proposed method runs at maximum speed 1.74 times real-time on the
embedded device while consuming 7.5MB of working memory for a 220K Korean
Point-of-Interest (POI) recognition domain.
- Sensei: A web
enabled tool for assessment and improvement of spoken English skills
Abhishek Chandel, Abhinav Parate, Maymon Madathingal, Himanshu Pant,Nitendra
Rajput, Shajith Ikbal, Om Deshmukh, Ashish Verma (IBM, India)
At IBM's India Research Lab we have developed an interactive web-enabled
tool, called Sensei, to evaluate various parameters of spoken English skills
including articulation of various phones, lexical stress pattern of syllables
in a word and spoken grammar. We have a paper accepted on Sensei in this
conference [1]. The demonstration will provide a live experience of all
the three main modules of Sensei: the user interaction and interface module,
the speech processing module and the content and configuration management
module. The user interaction and interface module delivers the audio data
from the server to the user’s web browser, transfers the audio recordings
from the user to the server and guides the user through the various stages
of the tool. The speech processing module uses the speech recognition engine
to recognize the spoken utterance, obtain the phonetic alignments and confidence
scores. The phonetic alignments along with a phone-to-syllable mapping are
used to compute syllable level prosodic features which classify the syllables
of the spoken word in correct or incorrect classes. For articulation evaluation,
the speech processing module combines the confidence scores obtained during
the phonetic alignment to compute an articulation score for the spoken utterance.
The module also computes a combined score for the overall assessment of
the user. The content and configuration management module controls the nature
and the difficulty level of the tool by altering the database used for evaluating
the various parameters, changing the time allotted for each of the evaluation
type, changing the number of allowed attempts to record user’s input
and so on. The demonstration will also provide the audience with an opportunity
to use the tool to evaluate their spoken English skills, receive scores
on the individual parameters as well as the combined score. Our current
efforts on developing a learning component as part of the Sensei tool that
can point out the mistakes committed by the user and provide feedback to
improve the user's spoken English skills will also be showcased.
[1] A. Chandel et. al., “Sensei: Spoken Language Assessment for Call
Center Agents”, [paper number 1098]
- A Speech-enabled
Card Game for Language Learning
Ian McGraw and Stephanie Seneff (MIT, USA)
Acquiring a second language as an adult is a monumental task regardless
of circumstance. Compounding the difficulty is the fact that many language
learners do not have the opportunity to speak in meaningful contexts outside
of the classroom. In recent years, MIT’s Spoken Language Systems (SLS)
group has been developing dialogue systems for language learners to alleviate
this need. These systems provide a non-threatening environment for the student
to practice their speech in a one-on-one setting. One drawback to such dialogue
systems, however, is that they inherently must cover a small domain to ensure
that the natural language processing (NLP) and automatic speech recognition
(ASR) components are provided with enough constraints to perform robustly.
The second language acquisition theory community, however, consistently
presents a case for learner-centered classrooms, in effect, giving the student
a significant amount of freedom in choosing the course contents. This necessitates
one of two solutions: 1) develop many different narrow-domain dialogue systems
from which students can choose, or 2) give the user the ability to personalize
the content loaded into a single system. In ASRU 2007, we would like to
demonstrate a speech-enabled card game that takes a step in direction of
the second solution presented above. Using the same technology that underlies
such popular online applications as GMail©, we have constructed a web-site
where a student of Mandarin Chinese can easily create and save a deck of
image-based flash-cards from within an ordinary Internet browser. Subsequently,
the student can load their flash-cards into “Word-War, a simple card
game environment built directly into the site. A speech-based system is
automatically configured which allows the user to manipulate the cards entirely
through spoken commands uttered in Mandarin. “Word-War” makes
two additional contributions to the community interested in ASR for second
language learners. First, our framework supports the ability to recognize,
understand, and react to partial utterances in real-time. Using this feature,
“Word-War” provides immediate visual feedback, allowing users
to simultaneously speak and check that their utterances are understood,
while side-stepping the issues associated with verbal barge-in. Second,
an engaging, multi-player mode connects two students via the web in a head-to-head,
vocabulary-building“ battle of words”. These three features,
personalization, real-time visual feedback, and multi-party interaction,
are combined into a single, web-based application that we believe presents
a particularly compelling use of speech technology in education.
[1] Short video demonstration available at http://people.csail.mit.edu/imcgraw/cardsdemo
- Pseudo-morpheme
and Confusion Network based Korean-English Statistical Spoken Language Translation
System
Donghyeon Lee, Jonghoon Lee, Gary Geunbae Lee (POSTECH, Korea)
In this demonstration, we present POSSLT (POSTECH Spoken Language Translation)
for a Korean-English statistical spoken language translation (SLT) system
using pseudo-morpheme and confusion network (CN) based technique. Like most
other SLT systems, automatic speech recognition (ASR) and machine translation
(MT) are coupled in a cascading manner in our SLT system. We used confusion
network based approach to couple ASR and MT. It has better translation quality
and faster decoding time than N-best approach. In the ASR and SMT for Korean,
how to define processing units affects the performance. Pseudo-morpheme
unit is a best choice for Korean-English SLT. Models used in SLT system
are trained on a travel domain conversational corpus.
- Agile Development
Framework for Voice-enabled Web Application
Masahiro Araki (Kyoto Inst. Tech., Japan)
This demo shows our very rapid prototyping system for voice-enabled Web
application, Vrails. It is based on one of Rails family Web application
framework, Grails. Rails frameworks follow MVC (Model-View-Controller) model
for interactive system development which clearly separates application logic
(model) and user interface part (view) intermediated by controller. Contrary
to ordinary state-based prototyping tools for spoken dialogue systems, such
as CSLU toolkit, Vrails starts with a definition of data structure then
generates all the rest components automatically. The controller part and
the model part are automatically generated following the ‘Convention
over Configuration’ strategy, which means basic operations such as
create, read, update, delete data are already prepared, and burdensome binding
definition between objects in scripting language and database records can
be omitted. Our contribution to this Rails framework is (1) to add voice
interaction part to view files (as XHTML+Voice) automatically, (2) to generate
grammar definition following the data definition and (3) to add more system-directive
interaction pattern in order to apply mobile device.
- Quizmaster
Mushrooms: “Who is this” Quiz Dialogue System
Yasuhiro Minami, Minako Sawaki, Ryuichiro Higashinaka, Kohji Dohsaka, Takeshi
Yamada,Tatsushi Matsubayashi, Hideki Isozaki, and Eisaku Maeda (NTT, Japan)
Our new research project, called “ambient intelligence,” concentrates
on the creation of new lifestyles through research on communication science
and intelligence integration. It is premised on the creation of such virtual
communication partners as fairies and goblins that can serve constantly
at our side. We call these virtual communication partners mushrooms. To
show the essence of ambient intelligence, we demonstrate a multimodal system:
a quizmaster mushroom. The purpose of the quizmaster mushroom is to transmit
knowledge from the system to users while they play a quiz game with the
system. The system can conduct a “who is this” quiz on certain
people selected from the Internet. The system works in real time using speech,
dialogue, and vision technologies [1].
[1] Y. Minami,et.al., “The World of Mushrooms: Human-Computer Interaction
Prototype Systems for Ambient
Intelligence,” Proc. ICMI2007, Nagoya, 2007 (to appear).
- Handheld Multi-lingual
Speech-to-Speech Translation System
Tohru Shimizu, Yutaka Ashikari, Eiichiro Sumita, Satoshi Nakamura (ATR, Japan)
In this demo, we introduce the recent progress of NICT-ATR speech-to-speech
translation system. Corpus-based approaches of recognition, translation,
and synthesis enable coverage of a wide variety of topics and portability
to other languages. In this system, basic component modules of Japanese,
English and Chinese are implemented in the terminal, and the system also
has an interface to access to other speech-to-speech translation resources
(e.g. component modules of other language pairs) located in the internet.
This system is organized around a module manager that has access to speech
recognition (ASR), machine translation (MT), speech synthesis (SS) and user
interface (UI) modules. The module manager has the function of controlling
information comprising speech data, recognized or translated text data,
and system messages between component modules. This architecture and event
based processing make it easier to extend the configuration to handle source
and target languages, tasks and domains. To realize connections between
internal and external speech-to-speech translation resources (e.g. ASR,
MT, SS servers of other language / language pairs) in the internet, we define
a first draft of Machine Translation Markup Language (STML) and implemented
web services using STML. The speech-to-speech translation system is designed
for use with mobile terminals. The size of the PC is W150 mm x D32 mm x
H95 mm. To use this system in a noisy environment, a uni-directional microphone
is used. Speech-to-speech translation can be performed for any combination
of Japanese, English, and Chinese languages. As the entire speech-to-speech
translation function is implemented into one terminal, it realizes real-time
and location free speech-to-speech translation service.
- SGStudio Web
Application
Ye-Yi Wang (Microsoft, USA)
Speech applications need grammars for language model and spoken language
understanding. In industrial applications, context free grammars are often
used. W3C has recommended Speech Recognition Grammar Specification (SRGS)
for grammar standardization, which is supported by many speech recognizers.
However, creating a customized grammar in SRGS is still a challenging task
that many speech application developers are facing. They have to get familiar
with the grammar specification, anticipate possible expressions that users
may use to refer to a concept, and script the semantic interpretation tags
that map users’ utterances to the corresponding canonical semantic
representations. Because of these difficulties, many developers choose to
use generic library grammar instead of creating a customized one, which
leads to high perplexity hence high recognition errors. SGStudioWA (Semantic
Grammar Studio) is a web application that helps speech application developers
rapidly create a semantic grammar in SRGS customized for their applications.
It takes as inputs high level specifications such as regular expressions
for alphanumeric concepts, cardinal/ordinal numbers in a range, etc., and
automatically generates grammars with appropriate semantic interpretations.
In addition, it can build recognition grammars from user provided examples.
The example-based grammar is robust to cover unseen expressions.
- Games for Eliciting
Human-Transcribed Data for Automated Directory Assistance
Tim Paek, Yun-Cheng Ju, Christopher Meek (Microsoft, USA)
Automated Directory Assistance (ADA) allows users to request telephone or
address information of residential and business listings using speech recognition
[1]. Because the caller usually does not know the exact name of the listing,
it is known that ADA systems require transcriptions of alternative phrasings
for directory listings as training data for building language models. Unfortunately,
such data can be very costly and time consuming to acquire. Since the introduction
of The ESP Game [2], researchers have sought to use games to tackle machine
learning problems such as image classification and paraphrasing journalistic
sentences for machine translation.
In this demo, we introduce two computer games, one text based (People Watcher
[3]) and the other telephony based (Marketeur), that elicit transcribed,
alternative user phrasings for directory listings while at the same time
entertaining players. Both games are framed as a marketing game that tests
people’s ability to “spot social trends” by having them
identify who they think would be likely customers of various businesses.
In this demo, we’ll describe how these two games work, the user interface
design, the technical challenges, the potential applications of the data
collected, and summarize how data collected from these games has, so far,
helped improving the performance.
[1] Levin, E. & Mane, A.M. (2005). “Voice User Interface Design
for Automated Directory Assistance”, in Proc. Interspeech, pp. 2509-2512.
[2] von Ahn, L. & Dabbish, L. (2004). “Labeling Images with a
Computer Game”, In Proc. CHI, pp. 319-326.
[3] Tim Paek, Yun-Cheng Ju, Christopher Meek (2007). “People Watcher:
A Game for Eliciting Human-Transcribed Data for Automated Directory Assistance”,
in Proc. Interspeech,