| Time: | Tuesday 16:00 | Place: | Hall A/B | Type: | Oral |
| Chair: | Mark J. F. Gales | ||||
| 16:00 | Boosting Systems for LVCSR |
| (IBM T.J. Watson Research Center) (IBM T.J. Watson Research Center) | |
| We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training. | |
| 16:20 | Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families |
| (IBM T.J. Watson Research Center) (IBM T.J. Watson Research Center) (IBM T.J. Watson Research Center) (IBM T.J. Watson Research Center) (IBM T.J. Watson Research Center) (IBM T.J. Watson Research Center) | |
| Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights. | |
| 16:40 | INTEGRATING MLP FEATURES AND DISCRIMINATIVE TRAINING IN DATA SAMPLING BASED ENSEMBLE ACOUSTIC MODELING |
| (Univ. of Missouri) (Univ. of Missouri) | |
| In this paper, we propose to incorporate the widely used Multiple Layer Perceptron (MLP) features and discriminative training (DT) into our recent data-sampling based ensemble acoustic models to further improve the quality of the individual models as well as the diversity among the models. We also propose applying speaker-model distance based speaker clustering for data sampling to construct ensembles of acoustic models for speaker independent speech recognition. By using these methods on the speaker independent TIMIT phone recognition task, we have obtained a phoneme recognition accuracy of 77.1% on the TIMIT complete test set, an absolute improvement of 5.4% over our conventional HMM baseline system, making this one of the best reported results on the TIMIT continuous phoneme recognition task. | |
| 17:00 | Semi-Supervised Training of Gaussian Mixture Models by Conditional Entropy Minimization |
| (University of Illinois at Urbana-Champaign) (University of Illinois at Urbana-Champaign) | |
| In this paper, we propose a new semi-supervised training method for Gaussian Mixture Models. We add a conditional entropy minimizer to the maximum mutual information criteria, which enables to incorporate unlabeled data in a discriminative training fashion. The training method is simple but surprisingly effective. The preconditioned conjugate gradient method provides a reasonable convergence rate for parameter update. The phonetic classification experiments on the TIMIT corpus demonstrate significant improvements due to unlabeled data via our training criteria. | |
| 17:20 | A Study of Irrelevant Variability Normalization Based Training and Unsupervised Online Adaptation for LVCSR |
| (Microsoft Research Asia, and Shanghai Jiao Tong University) (Microsoft Research Asia) (Microsoft Research Asia) | |
| This paper presents an experimental study of a maximum likelihood (ML) approach to irrelevant variability normalization (IVN) based training and unsupervised online adaptation for large vocabulary continuous speech recognition. A moving-window based frame labeling method is used for acoustic sniffing. The IVN-based approach achieves a 10% relative word error rate reduction over an ML-trained baseline system on a Switchboard-1 conversational telephone speech transcription task. | |
| 17:40 | Improvements to Generalized Discriminative Feature Transformation for Speech Recognition |
| (Language Technologies Institute, Carnegie Mellon University) (Language Technologies Institute, Carnegie Mellon University) (Language Technologies Institute, Carnegie Mellon University) | |
| Generalized Discriminative Feature Transformation (GDFT) is a feature space discriminative training algorithm for automatic speech recognition (ASR). GDFT uses Lagrange relaxation to transform the constrained maximum likelihood linear regression (CMLLR) algorithm for feature space discriminative training. This paper presents recent improvements on GDFT, which are achieved by regularization to the optimization problem. The resulting algorithm is called regularized GDFT (rGDFT) and we show that many regularization and smoothing techniques developed for model space discriminative training are also applicable to feature space training. We evaluated rGDFT on a real-time Iraqi ASR system and also on a large scale Arabic ASR task. |