Mon-Ses2-P3:
Special-purpose speech applications

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Monday 13:30 Place:International Conference Room C Type:Poster
Chair:Su-Youn Yoon
#1Evaluation of a Silent Speech Interface Based on Magnetic Sensing
Robin Hofe (Department of Computer Science, University of Sheffield, UK)
Stephen R. Ell (Department of Engineering, University of Hull, UK)
Michael J. Fagan (Department of Engineering, University of Hull, UK)
James M. Gilbert (Department of Engineering, University of Hull, UK)
Phil D. Green (Department of Computer Science, University of Sheffield, UK)
Roger K. Moore (Department of Computer Science, University of Sheffield, UK)
Sergey I. Rybchenko (Department of Engineering, University of Hull, UK)
This paper reports on isolated word recognition experiments using a novel silent speech interface. The interface consist of magnetic pellets that are fixed to relevant speech articulators, and a set of magnetic field sensors that measure changes in the overall magnetic field created by these pellets during speech. The reported experiments demonstrate the effectiveness of this technique and show the suitability of the system, even at its early stages of development, for small vocabulary speech recognition.
#2Advanced Speech Communication System for Deaf People
Rubén San-Segundo (Speech Technology Group at Universidad Politécnica de Madrid)
Verónica López (Speech Technology Group at Universidad Politécnica de Madrid)
Raquel Martín (Speech Technology Group at Universidad Politécnica de Madrid)
Syaheerah Lufti (Speech Technology Group at Universidad Politécnica de Madrid)
Javier Ferreiros (Speech Technology Group at Universidad Politécnica de Madrid)
Ricardo Cordoba (Speech Technology Group at Universidad Politécnica de Madrid)
José Manuel Pardo (Speech Technology Group at Universidad Politécnica de Madrid)
This paper describes the development and field evaluation of an Advanced Speech Communication System for Deaf People. The system has two modules. The first one is a Spanish into Spanish Sign Language (LSE: Lengua de Signos Española) translation module made up of a speech recognizer, a natural language translator (for converting a word sequence into a sequence of signs), and a 3D avatar animation module (for playing back the signs). The second module is a Spoken Spanish generator from sign-writing composed of a visual interface (for specifying a sign sequence), a language translator (for generating a Spanish sentence), and finally, a text to speech converter. The system integrates three translation technologies: an example-based strategy, a rule-based translation method and a statistical translator. The field evaluation was carried out in the Local Traffic Office in the city of Toledo (Spain) involving real government employees and deaf people.
#3Unsupervised Acoustic Model Adaptation for Multi-Origin Non Native ASR
Sethserey Sam (Laboratoir d'Informatique de Grenoble (LIG)-France / MICA Research Center-Vietnam)
Eric Castelli (MICA Research Center-Vietnam)
Laurent Besacier (Laboratoire d'Informatique de Grenoble (LIG)-France)
To date, the performance of speech and language recognition systems is poor on non-native speech. The challenge for non-native speech recognition is to maximize the accuracy of a speech recognition system when only a small amount of non-native data is available. We report on the acoustic model adaptation for improving the recognition of non-native speech in English, French and Vietnamese, spoken by speakers of different origins. Using online unsupervised adaptation acoustic modeling without any additional data for adapting purposes, we investigate how an unsupervised multilingual acoustic model interpolation method can help to improve the phone accuracy of the system. Results improvement of 7% of absolute phone level accuracy (PLA) obtained from the experiments demonstrate the feasibility of the method.
#4Speech-Based Automated Cognitive Status Assessment
Dilek Hakkani-Tür (International Computer Science Institute (ICSI))
Dimitra Vergyri (SRI International, Speech Technology and Research Lab)
Gokhan Tur (SRI International, Speech Technology and Research Lab)
Verbal interviews performed by trained clinicians are a common form of assessments to measure cognitive decline. The aim in this paper is to study the usability of automated methods for evaluating verbal cognitive status assessment tests for the elderly. If reliable, such methods for cognitive assessment can be used for frequent, non-intrusive, low-cost screenings and provide objective and longitudinal cognitive status monitoring data that can complement regular clinical visits and would be useful for early detection of conditions associated with language and communication impairments. This study focuses on two types of tests: a story-recall test, used for memory and language functioning assessment, and a picture description test, used to assess the information content in speech. A data collection was designed for this study involving recordings of about 100 people, mostly over 70 years old, performing these tests. The speech samples were manually transcribed and annotated with semantic units in order to obtain manual evaluation scores. We explore the use of automatic speech recognition and language processing methods to derive objective, automatically extracted metrics of cognitive status that are highly correlated with the manual scores. We use recall and precision based metrics based on semantic content units associated with the tests. Our experiments show high correlation between manually obtained scores and the automatic metrics obtained using eithermanual or automatic speech transcriptions.
#5Speech Recognition with a Seamlessly Updated Language Model for Real-Time Closed-Captioning
Toru Imai (NHK Science & Technology Research Laboratories)
Shinichi Homma (NHK Science & Technology Research Laboratories)
Akio Kobayashi (NHK Science & Technology Research Laboratories)
Takahiro Oku (NHK Science & Technology Research Laboratories)
Shoei Sato (NHK Science & Technology Research Laboratories)
It is desirable to consistently and seamlessly update a language model of speech recognition without stopping it for online applications such as real-time closed-captioning. This paper proposes a novel speech recognition system that enables the model to be updated at any time even while it is running. It can run the second decoder with the latest model in parallel, and their priority that must be accessed is controlled at a non-speech portion by an additional job process, which sends acoustic features only to an active target decoder with the latest model and sends recognized words to the backend manual error correction for closed-captioning. The system seamlessly updates the model and ensures endless speech recognition with the latest model at any time. Our new practical real-time closed-captioning system reduced word errors by two thirds with the proposed language model update mechanism in the speech recognition and captioning experiments for Japanese broadcast news programs.
#6The comparison between the deletion-based methods and the mixing-based methods for audio CAPTCHA systems
Takuya Nishimoto (Graduate School of Information Science and Technology, the University of Tokyo)
Takayuki Watanabe (Department of Communication, Division of Human Science, School of Arts and Sciences, Tokyo Woman's Christian University)
Audio CAPTCHA systems, which distinguish between software agents and human beings, are especially important for persons with visual disability. The popular approach is based on mixing-based methods (MBM), which use the mixed sounds of target speech and noises. We have proposed a deletion-based method (DBM) which uses the phonemic restoration effects. Our approach can control the difficulty of tasks simply by the masking ratio. In this paper, we propose a design principle of CAPTCHA, according to which the tasks should be designed so that the large difference of performance between the machines and human beings can be provided. We also show the experimental results that support the hypotheses as follows: (1) only using MBM, the degree of task difficulty can not be controlled easily, (2) using DBM, the degree of task difficulty and safeness of CAPTCHA system can be controlled easily.
#7Comparing mono- and multilingual acoustic seed models for a low e-resourced language: a case-study of Luxembourgish
Martine Adda-Decker (LIMSI-CNRS)
Lori Lamel (LIMSI-CNRS)
Natalie Snoeren (LIMSI-CNRS)
Luxembourgish is embedded in a multilingual context on the divide between Romance and Germanic cultures and has often been viewed as one of Europe's under-resourced languages. We focus on the acoustic modeling of Luxembourgish. By taking advantage of monolingual acoustic seeds selected from German, French or English model sets via IPA symbol correspondances, we investigated whether Luxembourgish spoken words were globally better represented by one of these languages. Although speech in Luxembourgish is frequently interspersed with French words, forced alignments on these data showed a clear preference for Germanic acoustic models with only a limited usage of French. German models provided the best match with 54% of the data, 35% for English and only 11% for French models. A further set of multilingual acoustic models, estimated from the pooled German, French, and English audio data allowed to capture between 27% and 48% of the data depending on conditions.
#8Manipulating Treacheoesophageal Speech
R.J.J.H. van Son (Netherlands Cancer Institute/ACLC)
Irene Jacobi (Netherlands Cancer Institute)
Frans J. M. Hilgers (Netherlands Cancer Institute)
Speech therapy aiming at improving voice quality and speech intelligibility is often hampered by the lack of knowledge of the underlying deficits. One way to help speech therapists treating patients would be to supply synthetic bench-marks for pathological speech. In a listening experiment testing perceived intelligibility, three types of manipulations of tracheoesophageal speech were evaluated by experienced speech therapists. It was found that modeling the intensity contour of the voice source signal improved speech quality over plain analysis-synthesis. Replacing the voicing source with fully synthetic source periods decreased the perceived intelligibility markedly. Making the source fully periodic with a regular pitch had no effect on perceived intelligibility. Low quality speech benefited more from manipulations, or deteriorated less, than high quality speech.
#9Towards mixed language speech recognition systems
David Imseng (Idiap Research Institute, Martigny, Switzerland)
Hervé Bourlard (Idiap Research Institute, Martigny, Switzerland)
Mathew Magimai Doss (Idiap Research Institute, Martigny, Switzerland)
Multilingual speech recognition obviously involves numerous research challenges, including common phoneme sets, adaptation on limited amount of training data, as well as mixed language recognition (common in many countries, like Switzerland). In this latter case, it is not even possible to assume that one knows in advance the language being spoken. This is the context and motivation of the present work. We indeed investigate how current state-of-the-art speech recognition systems can be exploited in multilingual environments, where the language (from an assumed set of 5 possible languages, in our case) is not a priori known during recognition. We combine monolingual systems and extensively develop and compare different features and acoustic models. On SpeechDat(II) datasets, and in the context of isolated words, we show that it is actually possible to approach performances of monolingual systems even if the identity of the spoken language is not a priori known.
#10Voice Search for Development
Etienne Barnard (Human Language Technologies Research Group, Meraka Institute, CSIR)
Johan Schalkwyk (Google Research)
Charl van Heerden (Human Language Technologies Research Group, Meraka Institute, CSIR)
Pedro J Moreno (Google Research)
In light of the serious problems with both illiteracy and information access in the developing world, there is a widespread belief that speech technology can play a significant role in improving the quality of life of developing-world citizens. We review the main reasons why this impact has not occurred to date, and propose that voice-search systems may be a useful tool in delivering on the original promise. The challenges that must be addressed to realize this vision are analyzed, and initial experimental results in developing voice search for two languages of South Africa (Zulu and Afrikaans) are summarized.
#11Cross-cultural Investigation of Prosody in Verbal Feedback in Interactional Rapport
Gina-Anne Levow (University of Chicago)
Susan Duncan (University of Chicago)
Edward King (University of Chicago)
Aspects of speech and non-verbal behavior allow conversational partners to establish and maintain rapport by signaling engagement or endorsement. In the verbal channel, these factors encompass requests for and production of vocal feedback, as well as lexical and grammatical mirroring. However, these cues are often subtle and culture-specific. Here, we present a preliminary investigation of the differences in elicitation and provision of vocal feedback across three diverse language/cultural groups: American English, Gulf/Iraqi Arabic, and Mexican Spanish. Based on a fully-transcribed and aligned sub-corpus of 80 interactions, we identify fundamental contrasts in production of vocal feedback. We identify dramatic differences in the rates of listener verbal feedback across the groups. However, we find both similarities and differences in the use of prosodic cues across these groups.. These differences will inform the development of culturally-sensitive conversational agents..
#12Using Oriented Optical Flow Histograms for Multimodal Speaker Diarization
Mary Tai Knox (International Computer Science Institute)
Gerald Friedland (International Computer Science Institute)
Speaker diarization is the task of partitioning an input stream into speaker homogeneous regions, or in other words, to determine ”who spoke when.” While approaches to this problem have traditionally relied entirely on the audio stream, the availability of accompanying video streams in recent diarization corpora has prompted the study of methods based on multimodal audio-visual features. In this work, we propose the use of robust video features based on oriented optical flow histograms. Using the state-of-the art ICSI diarization system, we show that, when combined with standard audio features, these features improve the diarization error rate by 14% percent over an audio only baseline.
#13Towards an ASR-free objective analysis of pathological speech
Catherine Middag (ELIS, Ghent University, Belgium)
Yvan Saeys (VIB, Ghent University, Belgium)
Jean-Pierre Martens (ELIS, Ghent University, Belgium)
Nowadays, intelligibility is a popular measure of the severity of the articulatory deficiencies of a pathological speaker. Usually, this measure is obtained by means of a perceptual test, consisting of nonconventional and/or nonconnected words. In previous work, we developed a system incorporating two Automatic Speech Recognizers (ASR) that could fairly accurately estimate phoneme intelligibility (PI). In the present paper, we propose a novel method that aims to assess the running speech intelligibility (RSI) as a more relevant indicator of the communication efficiency of a speaker in a natural setting. The proposed method computes a phonological characterization of the speaker by means of a statistical analysis of frame-level phonological features. Important is that this analysis requires no knowledge of what the speaker was supposed to say. The new characterization is demonstrated to predict PI and to provide valuable information about the nature and severity of the pathology.

top