Mon-Ses2-P2:
ASR: Search, Decoding and Confidence Measures I

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Monday 13:30 Place:International Conference Room B Type:Poster
Chair:Takaaki Hori
#1Phone Mismatch Penalty Matrices for Two-Stage Keyword Spotting Via Multi-Pass Phone Recognizer
Han Chang Woo (Seoul National University)
Kang Shin Jae (Seoul National University)
Lee Chul Min (Seoul National University)
Kim Nam Soo (Seoul National University)
In this paper, we propose a novel approach to estimate three types of phone mismatch penalty matrices for two-state keyword spotting. When the output of a phone recognizer is given, text matching with the phone sequences provided by the specified keyword using the proposed phone mismatch penalty matrices is carried out to detect a specific keyword. The penalty matrices which is estimated from the training data through deliberate error generation are accounting for substitution, insertion and deletion errors. In comparative experiments on a Korean continuous speech recognition task, the proposed approach has shown a significant improvement.
#2English Spoken Term Detection in Multilingual Recordings
Petr Motlicek (Idiap Research Institute, Martigny, Switzerland)
Fabio Valente (Idiap Research Institute, Martigny, Switzerland)
Philip Garner (Idiap Research Institute, Martigny, Switzerland)
This paper investigates the automatic detection of English spoken terms in a multi-language scenario over real lecture recordings. Spoken Term Detection (STD) is based on an LVCSR where the output is represented in the form of word lattices. The lattices are then used to search the required terms. Processed lectures are mainly composed of English, French and Italian recordings where the language can also change within one recording. Therefore, the English STD system uses an Out-Of-Language (OOL) detection module to filter out non-English input segments. OOL detection is evaluated w.r.t. various confidence measures estimated from word lattices. Experimental studies of OOL detection followed by English STD are performed on several hours of multilingual recordings. Significant improvement of OOL+STD over a stand-alone STD system is achieved (relatively more than 50% in EER). Finally, an additional modality (text slides in the form of PowerPoint presentations) is exploited to improve STD.
#3A Hybrid Approach to Robust Word Lattice Generation Via Acoustic-Based Word Detection
Icksang Han (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Chiyoun Park (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongmi Cho (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
Jeongsu Kim (Samsung Advanced Institute of Technology, Samsung Electronics Co., Ltd.)
A large-vocabulary continuous speech recognition (LVCSR) system usually utilizes a language model in order to reduce the complexity of the algorithm. However, the constraint also produces side-effects including low accuracy of the out-of-grammar sentences and the error propagation of misrecognized words. In order to compensate for the side-effects of the language model, this paper proposes a novel lattice generation method that adopts the idea from the keyword detection method. By combining the word candidates detected mainly from the acoustic aspect of the signal to the word lattice from the ordinary speech recognizer, a hybrid lattice is constructed. The hybrid lattice shows 33% improvement in terms of the lattice accuracy under the condition where the lattice density is the same. In addition, it is observed that the proposed model shows less sensitivity to the out-of-grammar sentences and to the error propagation due to misrecognized words.
#4Direct Observation of Pruning Errors (DOPE): A Search Analysis Tool
Volker Steinbiss (RWTH Aachen University)
Martin Sundermeyer (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)
The search for the optimal word sequence can be performed efficiently even in a speech recognizer with a very large vocabulary and complex models. This is achieved using pruning methods with empirically chosen parameters and the willingness to accept a certain amount of pruning errors. Quite unsatisfying though, it is state-of-the-art that such pruning errors are not directly detected but, instead, indirect consequences of them, providing only a rough picture of what happens during search. With the tool Direct Observation of Pruning Errors (DOPE), described in this paper, pruning errors are detected on the state hypothesis level, which is a very fine level of granulation, several orders of magnitude finer than the sentence level. This allows much more exact analyses, including the analysis of pruning methods, or the effects of pruning parameters.
#5Direct Construction of Compact Context-Dependency Transducers From Data
David Rybach (RWTH Aachen University, Germany)
Michael Riley (Google Inc., USA)
This paper describes a new method for building compact context-dependency transducers for finite-state transducer-based ASR decoders. Instead of the conventional phonetic decision-tree growing followed by FST compilation, this approach incorporates the phonetic context splitting directly into the transducer construction. The objective function of the split optimization is augmented with a regularization term that measures the number of transducer states introduced by a split. We give results on a large spoken-query task for various n-phone orders and other phonetic features that show this method can greatly reduce the size of the resulting context-dependency transducer with no significant impact on recognition accuracy. This permits using context sizes and features that might otherwise be unmanageable.
#6 Incremental composition of static decoding graphs with label pushing
Miroslav Novak (IBM)
We present new results achieved in the application of incremental graph composition algorithm, in particular using the label pushing method to further reduce the final graph size. In our previous work we have shown that the incremental composition is an efficient alternative to the conventional finite state transducer (FST) determinization-composition-minimization approach, with some limitations. One of the limitations was that the word labels must stay aligned with the actual word ends. We describe an updated version of the algorithm which allows us to push the word labels relatively to the word ends to increase the effect of the minimization. The size of resulting graph is now very close to the ones produced by the conventional FST approach with label pushing.
#7A Novel Path Extension Framework Using Steady Segment Detection for Mandarin Speech Recognition
Zhanlei Yang (National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Wenju Liu (National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Frame based decoders are short of using long span of time knowledge while segment based decoders often confuse with complex calculating. This paper proposes a novel decoding framework by integrating steady speech segments information into path extension procedure. Firstly, as baseline decoding system, a dynamic lexicon-tree copy recognizer is developed, which aims to accelerate popular frame based recognizer, HTK. Steady segments, where the spectrum is stable, are extracted using landmark detection, and then detection results are provided to the following decoding module. At decoding stage, traditional inter-HMM token spreading framework is modified using steady segment knowledge, based on the observation that coexistence of steady frame and inter-HMM extension is impossible. Experiments conducted on Mandarin broadcasting speech show that the character error rate and run time achieve 22.1% and 5.24% relative reduction respectively.
#8On the relation of Bayes Risk, Word Error, and Word Posteriors in ASR
Ralf Schlueter (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Markus Nussbaum-Thom (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
Hermann Ney (Lehrstuhl fuer Informatik 6 - Computer Science Department, RWTH Aachen University)
In automatic speech recognition, we are faced with a well-known inconsistency: Bayes decision rule is usually used to minimize sentence (word sequence) error, whereas in practice we want to minimize word error, which also is the usual evaluation measure. Recently, a number of speech recognition approaches to approximate Bayes decision rule with word error (Levenshtein/edit distance) cost were proposed. Nevertheless, experiments show that the decisions often remain the same and that the effect on the word error rate is limited, especially at low error rates. In this work, further analytic evidence for these observations is provided. A set of conditions is presented, for which Bayes decision rule with sentence and word error cost function leads to the same decisions. Furthermore, the case of word error cost is investigated and related to word posterior probabilities. The analytic results are verified experimentally on several large vocabulary speech recognition tasks.
#9Time Condition Search in Automatic Speech Recognition Reconsidered
David Nolden (RWTH Aachen)
Hermann Ney (RWTH Aachen)
Ralf Schlueter (RWTH Aachen)
In this paper we re-investigate the time conditioned search (TCS) method in comparison to the well known word conditioned search, and analyze its applicability on state-of-the-art large vocabulary continuous speech recognition tasks. In contrast to current standard approaches, time conditioned search offers theoretical advantages particularly in combination with huge vocabularies and huge language models, but it is difficult to combine with across word modelling, which was proven to be an important technique in automatic speech recognition. Our novel contributions for TCS are a pruning step during the recombination called Early Word End Pruning, an additional recombination technique called Context Recombination, the idea of a Startup Interval to reduce the number of started trees, and a mechanism to combine TCS with across word modelling. We show that, with these techniques, TCS can outperform WCS on a current task.
#10Efficient Data Selection for Speech Recognition Based on Prior Confidence Estimation Using Speech and Context Independent Models
Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories, NTT Corporation)
Taichi ASAMI (NTT Cyber Space Laboratories, NTT Corporation)
Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu MASATAKI (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi TAKAHASHI (NTT Cyber Space Laboratories, NTT Corporation)
This paper proposes an efficient data selection technique to identify well recognized texts in massive volumes of speech data. Conventional confidence measure techniques can be used to obtain this accurate data, but they require speech recognition results to estimate confidence. Without a significant level of confidence, considerable computer resources are wasted since inaccurate recognition results are generated only to be rejected later. The technique proposed herein rapidly estimates the prior confidence based on just an acoustic likelihood calculation by using speech and context independent models before speech recognition processing; it then recognizes data with high confidence selectively. Simulations show that it matches the data selection performance of the conventional posterior confidence measure with less than 2 % of the computation time.
#11A Novel Confidence Measure Based on Marginalization of Jointly Estimated Error Cause Probabilities
Atsunori Ogawa (NTT Corporation)
Atsushi Nakamura (NTT Corporation)
We propose a novel confidence measure based on the marginalization of jointly estimated error cause probabilities. Conventional confidence measures directly score the reliability of recognition results. In contrast, our method first calculates joint confidence and error cause probabilities and then sums them with respect to the error cause patterns to obtain the marginal confidence probability. We show experimentally that, the confidence estimation accuracy obtained with the proposed method is significantly improved compared with that obtained with the conventional confidence measure.

top