Tue-Ses3-O1:
ASR: Acoustic Models II

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Tuesday 16:00 Place:Hall A/B Type:Oral
Chair:Mark J. F. Gales
16:00Boosting Systems for LVCSR
George Saon (IBM T.J. Watson Research Center)
Hagen Soltau (IBM T.J. Watson Research Center)
We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training.
16:20Incorporating Sparse Representation Phone Identification Features in Automatic Speech Recognition using Exponential Families
Vaibhava Goel (IBM T.J. Watson Research Center)
Tara Sainath (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
Peder Olsen (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)
Dimitri Kanevsky (IBM T.J. Watson Research Center)
Sparse representation phone identification features (SPIF) is a recently developed technique to obtain an estimate of phone posterior probabilities conditioned on an acoustic feature vector. In this paper, we explore incorporating SPIF phone posterior probability estimates in large vocabulary continuous speech recognition (LVCSR) task by including them as additional features of exponential densities that model the HMM state emission likelihoods. We compare our proposed approach to a number of other well known methods of combining feature streams or multiple LVCSR systems. Our experiments show that using exponential models to combine features results in a word error rate reduction of 0.5% absolute (18.7% down to 18.2%); this is comparable to best error rate reduction obtained from system combination methods, but without having to build multiple systems or tune the system combination weights.
16:40INTEGRATING MLP FEATURES AND DISCRIMINATIVE TRAINING IN DATA SAMPLING BASED ENSEMBLE ACOUSTIC MODELING
Xin Chen (Univ. of Missouri)
Yunxin Zhao (Univ. of Missouri)
In this paper, we propose to incorporate the widely used Multiple Layer Perceptron (MLP) features and discriminative training (DT) into our recent data-sampling based ensemble acoustic models to further improve the quality of the individual models as well as the diversity among the models. We also propose applying speaker-model distance based speaker clustering for data sampling to construct ensembles of acoustic models for speaker independent speech recognition. By using these methods on the speaker independent TIMIT phone recognition task, we have obtained a phoneme recognition accuracy of 77.1% on the TIMIT complete test set, an absolute improvement of 5.4% over our conventional HMM baseline system, making this one of the best reported results on the TIMIT continuous phoneme recognition task.
17:00Semi-Supervised Training of Gaussian Mixture Models by Conditional Entropy Minimization
Jui-Ting Huang (University of Illinois at Urbana-Champaign)
Mark Hasegawa-Johnson (University of Illinois at Urbana-Champaign)
In this paper, we propose a new semi-supervised training method for Gaussian Mixture Models. We add a conditional entropy minimizer to the maximum mutual information criteria, which enables to incorporate unlabeled data in a discriminative training fashion. The training method is simple but surprisingly effective. The preconditioned conjugate gradient method provides a reasonable convergence rate for parameter update. The phonetic classification experiments on the TIMIT corpus demonstrate significant improvements due to unlabeled data via our training criteria.
17:20A Study of Irrelevant Variability Normalization Based Training and Unsupervised Online Adaptation for LVCSR
Guangchuan Shi (Microsoft Research Asia, and Shanghai Jiao Tong University)
Yu Shi (Microsoft Research Asia)
Qiang Huo (Microsoft Research Asia)
This paper presents an experimental study of a maximum likelihood (ML) approach to irrelevant variability normalization (IVN) based training and unsupervised online adaptation for large vocabulary continuous speech recognition. A moving-window based frame labeling method is used for acoustic sniffing. The IVN-based approach achieves a 10% relative word error rate reduction over an ML-trained baseline system on a Switchboard-1 conversational telephone speech transcription task.
17:40Improvements to Generalized Discriminative Feature Transformation for Speech Recognition
Roger Hsiao (Language Technologies Institute, Carnegie Mellon University)
Florian Metze (Language Technologies Institute, Carnegie Mellon University)
Tanja Schultz (Language Technologies Institute, Carnegie Mellon University)
Generalized Discriminative Feature Transformation (GDFT) is a feature space discriminative training algorithm for automatic speech recognition (ASR). GDFT uses Lagrange relaxation to transform the constrained maximum likelihood linear regression (CMLLR) algorithm for feature space discriminative training. This paper presents recent improvements on GDFT, which are achieved by regularization to the optimization problem. The resulting algorithm is called regularized GDFT (rGDFT) and we show that many regularization and smoothing techniques developed for model space discriminative training are also applicable to feature space training. We evaluated rGDFT on a real-time Iraqi ASR system and also on a large scale Arabic ASR task.

top