| Time: | Wednesday 13:30 | Place: | Hall A/B | Type: | Oral |
| Chair: | Michael Riley | ||||
| 13:30 | CRF-based Combination of Contextual Features to Improve A Posteriori Word-level Confidence Measures |
| (IRISA/INRIA Rennes, France) (University of Rennes 2/IRISA Rennes, France) (IRISA/INSA Rennes, France) (IRISA/CNRS Rennes, France) (IRISA/INRIA Rennes, France) | |
| This paper addresses the issue of confidence measure reliability provided by automatic speech recognition systems for use in various spoken language processing applications. We propose a method based on conditional random field to combine contextual features to improve word-level confidence measures. The method consists in combining various knowledge sources (acoustic, lexical, linguistic, phonetic and morphosyntactic) to enhance confidence measures, explicitly exploiting context information. Experiments were conducted on a large French broadcast news corpus from the ESTER benchmark. Results demonstrate the added-value of our method with a significant improvement of the normalized cross entropy and of the equal error rate. | |
| 13:50 | Recognition of Spontaneous Conversational Speech using Long Short-Term Memory Phoneme Predictions |
| (Technische Universitaet Muenchen) (Technische Universitaet Muenchen) (Technische Universitaet Muenchen) (Technische Universitaet Muenchen) | |
| We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long Short-Term Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary continuous speech recognition (LVCSR) decoder, together with conventional MFCC features. We evaluate both, phoneme prediction error rates of various network architectures and the word recognition performance of our Tandem approach using the COSINE database - a large corpus of conversational and noisy speech, and show that incorporating LSTM phoneme predictions in to an LVCSR system leads to significantly higher word accuracies. | |
| 14:10 | Improving ASR error detection with non-decoder based features |
| (INESC-ID) (INESC-ID IST) | |
| This study reports error detection experiments in large vocabulary automatic speech recognition (ASR) systems, by using statistical classifiers. We explored new features gathered from other knowledge sources than the decoder itself: a binary feature that compares outputs from two different ASR systems (word by word), a feature based on the number of hits of the hypothesized bigrams, obtained by queries entered into a very popular Web search engine, and finally a feature related to automatically infered topics at sentence and word levels. Experiments were conducted on a European Portuguese broadcast news corpus. The combination of baseline decoder-based features and two of these additional features led to significant improvements, from 13.87% to 12.16% classification error rate (CER) with a maximum entropy model, and from 14.01% to 12.39% CER with linear-chain conditional random fields, comparing to a baseline using only decoder-based features. | |
| 14:30 | Phoneme Classification and Lattice Rescoring Based on a k-NN Approach |
| (INRS) (INRS) | |
| In this paper we propose a k-NN/SASH phoneme classification algorithm that competes favourably with state-of- the-art methods. We apply a similarity search algorithm (SASH) that has been used successfully for classification of high dimensional texts and images. Unlike other search algorithms, the computational time of SASH is not affected by the dimensionality of the data. Therefore, we generate fixed-length but high-dimensional feature vectors for phonemes using their underlying frames and those of boundaries. The k-NN/SASH phoneme classifier is fast, efficient, and could achieve a classification rate of 79.2% for the TIMIT test database. Finally, we apply this algorithm to rescore phoneme lattices, generated by the GMM-HMM monophone recognizer for both context-independent and context-dependent tasks. In both cases, the k-NN/SASH classifier leads to improvements in the recognition rate. | |
| 14:50 | Online Adaptive Learning for Speech Recognition Decoding |
| (University of Washington) (University of Washington) | |
| We describe a new method for pruning in dynamic models based on running an adaptive filtering algorithm online during decoding to predict aspects of the scores in the near future. These predictions are used to make well-informed pruning decisions during model expansion. We apply this idea to the case of dynamic graphical models and test it on a speech recognition database derived from Switchboard. Results show that significant (approximately factor of 2) speedups can be obtained without any decrease in word error rate or increase in memory usage. | |
| 15:10 | Improvements of Search Error Risk Minimization in Viterbi Beam Search for Speech Recognition |
| (NTT Corporation) (NTT Corporation) (NTT Corporation) | |
| This paper describes improvements in a search error risk minimization approach to fast beam search for speech recognition. In our previous work, we proposed this approach to reduce search errors by optimizing the pruning criterion. While conventional methods use heuristic criteria to prune hypotheses, our proposed method employs a pruning function that makes a more precise decision using rich features extracted from each hypothesis. The parameters of the function can be estimated to minimize a loss function based on the search error risk. In this paper, we improve this method by introducing a modified loss function, arc-averaged risk, which potentially has a higher correlation with actual error rate than the original one. We also investigate various combinations of features. Experimental results show that further search error reduction over the original method is obtained in a 100K-word vocabulary lecture speech transcription task. |