| Time: | Wednesday 10:00 | Place: | International Conference Room D | Type: | Poster |
| Chair: | Bhiksha Raj | ||||
| #1 | An Empirical Comparison of the T3, Juicer, HDecode and Sphinx3 Decoders |
| (Tokyo Institute of Technology) (National Institute of Information and Communications Technology) (Tokyo Institute of Technology) | |
| In this paper we perform a cross-comparison of the T3 WFST decoder against three different speech recognition decoders on three separate tasks of variable difficulty. We show that the T3 decoder performs favorably against several established veterans in the field, including the Juicer WFST decoder, Sphinx3, and HDecode in terms of RTF versus Word Accuracy. In addition to comparing decoder performance, we evaluate both Sphinx and HTK acoustic models on a common footing inside T3, and show that the speed benefits that typically accompany the WFST approach increase with the size of the vocabulary and other input knowledge sources. In the case of T3, we also show that GPU acceleration can significantly extend these gains. | |
| #2 | Tracter: A lightweight Dataflow Framework |
| (Idiap Research Institute) (Idiap Research Institute) | |
| Tracter is introduced as a dataflow framework particularly useful for speech recognition. It is designed to work on-line in real-time as well as off-line, and is the feature extraction means for the Juicer transducer based decoder. This paper places Tracter in context amongst the dataflow literature and other commercial and open source packages. Some design aspects and capabilities are discussed. Finally, a fairly large processing graph incorporating voice activity detection and feature extraction is presented as an example of Tracter's capabilites. | |
| #3 | Verifying Pronunciation Dictionaries using Conflict Analysis |
| (CSIR South Africa) (CSIR South Africa) | |
| We describe a new technique for automatically identifying errors in an electronic pronunciation dictionary which analyzes the source of conflicting patterns directly. We evaluate the effectiveness of this technique in two ways: we perform a controlled experiment using artificially corrupted data (allowing us to measure precision and recall exactly); and then apply the technique to a real-world pronunciation dictionary, demonstrating its effectiveness in practice. We also introduce a new freely available pronunciation resource (the RCRL Afrikaans Pronunciation Dictionary), the largest such dictionary that is currently available. | |
| #4 | Automatic Estimation of Transcription Accuracy and Difficulty |
| (MIT) (MIT) (MIT) | |
| Managing a large-scale speech transcription task with a team of human transcribers requires effective quality control and workload distribution. As it becomes easier and cheaper to collect massive audio corpora the problem is magnified. Relying on expert review or transcribing all speech multiple times is impractical. Furthermore, speech that is difficult to transcribe may be better handled by a more experienced transcriber or skipped entirely. We present a fully automatic system to address these issues. First, we use the system to estimate transcription accuracy from a a single transcript and show that it correlates well with inter-transcriber agreement. Second, we use the system to estimate the transcription "difficulty" of a speech segment and show that it is strongly correlated with transcriber effort. This system can help a transcription manager determine when speech segments may require review, track transcriber performance, and efficiently manage the transcription process. | |
| #5 | Creating a semantic coherence dataset with non-expert annotators |
| (Language Technologies Institute, Carnegie Mellon University) (Language Technologies Institute, Carnegie Mellon University) (Language Technologies Institute, Carnegie Mellon University) | |
| We describe the creation of a linguistic plausibility dataset that contains annotated examples of language judged to be linguistically plausible, implausible, and every-thing in between. To create the dataset we randomly generate sentences and have them annotated by crowd sourcing over the Amazon Mechanical Turk. Obtaining inter-annotator agreement is a difficult problem because linguistic plausibility is highly subjective. The annotations obtained depend, among other factors, on the manner in which annotators are ques- tioned about the plausibility of sentences. We describe our experi- ments on posing a number of different questions to the annotators, in order to elicit the responses with greatest agreement, and present several methods for analyzing the resulting responses. The generated dataset and annotations are being made available to public. | |
| #6 | Construction and Evaluations of an Annotated Chinese Conversational Corpus in Travel Domain for Language Model of Speech Recognition |
| (National Institute of Information and Communications Technology, Japan) (National Institute of Information and Communications Technology, Japan) (National Institute of Information and Communications Technology, Japan) (National Institute of Information and Communications Technology, Japan) | |
| In this paper we describe the development of an annotated Chinese conversational textual corpus for speech recognition in a speech-to-speech translation system in the travel domain. A total of 515,000 manually checked utterances were constructed, which provided a 3.5 million word Chinese corpus with word segmentation and part-of-speech tagging. The annotation is conducted with careful manual checking. The specifications on word segmentation and POS-tagging are designed to follow the main existing Chinese corpora that are widely accepted by researchers of Chinese natural language processing. Many particular features of conversational texts are also taken into account. With this corpus, parallel corpora are obtained together with the corresponding pairs of Japanese and English texts from which the Chinese was translated. To evaluate the corpus, the language models built by it are evaluated using perplexity and speech recognition accuracy as criteria. The perplexity of the Chinese language model is verified as having reached a reasonably low level. Recognition performance is also found to be comparable to the other two languages, even though the quantity of training data for Chinese is only half the other two languages. | |
| #7 | Building transcribed speech corpora quickly and cheaply for many languages |
| (Google) (Google) (Google) (Google) (Google) (Google) | |
| We present a system for quickly and cheaply building transcribed speech corpora containing utterances from many speakers in a variety of acoustic conditions. The system consists of a client application running on an Android mobile device with an intermittent Internet connection to a server. The client application collects demographic information about the speaker, fetches textual prompts from the server for the speaker to read, records the speaker’s voice, and uploads the audio and associated metadata to the server. The system has so far been used to collect over 3000 hours of transcribed audio in 17 languages around the world. | |
| #8 | The CHiME corpus: a resource and a challenge for Computational Hearing in Multisource Environments |
| (University of Sheffield) (University of Sheffield) (University of Sheffield) (University of Sheffield) | |
| We present a new corpus designed for noise-robust speech processing research, CHiME. Our goal was to produce material which is both natural (derived from reverberant domestic environments with many simultaneous and unpredictable sound sources) and controlled (providing an enumerated range of SNRs spanning 20 dB). The corpus includes around 40 hours of background recordings from a head and torso simulator positioned in a domestic setting, and a comprehensive set of binaural impulse responses collected in the same environment. These have been used to add target utterances from the Grid speech recognition corpus into the CHiME domestic setting. Data has been mixed in a manner that produces a controlled and yet natural range of SNRs over which speech separation, enhancement and recognition algorithms can be evaluated. The paper motivates the design of the corpus, and describes the collection and post-processing of the data. We also present a set of baseline recognition results. | |
| #9 | Developing A Chinese L2 Speech Database of Japanese Learners With Narrow-Phonetic Labels For Computer Assisted Pronunciation Training |
| (Center of Studies of Chinese as a Second Language,Beijing Language and Culture University, P. R. China) (Center of Studies of Chinese as a Second Language,Beijing Language and Culture University, P. R. China) (Center of Studies of Chinese as a Second Language, College of Information Science, Beijing Language and Culture University, P. R. China) (Institute of Linguitics, Chinese Aademy of Social Sciences) | |
| For the purpose of developing Computer Assisted Pronunciation Training (CAPT) technology with more informative feedbacks, we propose to use a set of narrow-phonetic labels to annotate Chinese L2 speech database of Japanese learners. The labels include basic units of “Initials”, “Finals” for Chinese phonemes and diacritics for erroneous articulation tendencies. Pilot investigations were made on the annotating consistencies of two sets of phonetic transcriptions in 17 speakers’ data. The results indicate the consistencies are moderately good, suggesting that the annotating procedure be practical, and there are also rooms for further improvement. | |
| #10 | How Children Acquire Situation Understanding Skills?: A Developmental Analysis Utilizing Multimodal Speech Behavior Corpus |
| (Shizuoka University) (Shizuoka University) (Shizuoka University) (Shizuoka University) | |
| We have developed a multimodal speech behavior corpus which includes metadata annotated from various viewpoints such as, utterances, actions, emotions and intention for analyzing behavioral factors of thinking processes from various perspectives in everyday life. Utilizing the corpus, we analyzed child development of situation understanding skills focused on "attention-catching" that has a role as a signal when communicating with other people. We formulated a hypothesis of the developmental process that there is a connection between physical expression skills and mental conditions such as utterances, gestures and the attention ability. The analysis results showed that the situation understanding skills follow the similar development, which is a change of object-centric to person-centric, despite the age of developmental change is different. Furthermore, the analysis results provided us with a more in-depth construction of the corpus. | |
| #11 | The Influence of Expertise and Efficiency on Modality Selection Strategies and Perceived Mental Effort |
| (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany) (Research training group prometei, TU Berlin, Germany) (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany) (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany) (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Germany) | |
| This paper describes a user study investigating the influence of expertise and efficiency on modality selection (speech vs. virtual keyboard) and perceived mental effort. Efficiency was varied in terms of interaction steps. The goal was to investigate if the number of necessary interaction steps determines the preference for a specific modality. It is shown that the threshold for changing the modality selection strategy is at three (experts) respectively four (novices) interaction steps. | |
| #12 | Parameters Describing Multimodal Interaction - Definitions and Three Usage Scenarios |
| (Quality and Usability Lab, Technische Universität Berlin, Germany) (Quality and Usability Lab, Technische Universität Berlin, Germany) (Quality and Usability Lab, Technische Universität Berlin, Germany) | |
| While multimodal systems are an active research field, there is no agreed-upon set of multimodal interaction parameters, which would allow to quantify the performance of such systems and their underlying modules, and would therefor be necessary for a systematic evaluation. In this paper we propose an extension to established parameters describing the interaction with spoken dialog systems [Möller 2005] in order to be used for multimodal systems. Focussing on the evaluation of a multimodal system, three usage scenarios for these parameters are given. | |
| #13 | Repair Strategies on Trial: Which Error Recovery Do Users Like Best? |
| (University of Ulm) (University of Ulm) (University of Ulm) (University of Ulm) | |
| Extensive research about recovery strategies for misunderstandings and non-understandings within the context of spoken dialogue systems (SDS) has been undertaken in the past and is still going on. Many scientists focus on optimizing the recovery rate using various strategies. It is still not sufficiently explored, how different strategies relate to user satisfaction, and how confused users get with simple strategies such as a reprompt. We carried out an empirical analysis with some of the most promising strategies. In addition to the two common strategies help and reprompt we also evaluated an adapted version of the promising MoveOn strategy. We found that the reactions regarding our different mockup dialogues, especially between computer experts and novices, vary a lot. |