| Time: | Tuesday 13:30 | Place: | International Conference Room B | Type: | Poster |
| Chair: | Tetsuya Takiguchi | ||||
| #1 | Improved Phoneme Recognition by Integrating Evidence from Spectro-temporal and Cepstral Features |
| (Graduate Institute of Communication Engineering, National Taiwan University, Taiwan) (Graduate Institute of Communication Engineering, National Taiwan University, Taiwan) (Graduate Institute of Communication Engineering, National Taiwan University, Taiwan) | |
| Gabor features have been proposed for extracting spec-tro-temporal modulation information, and yielding significant improvements in recognition performance. In this paper, we propose the integration of Gabor posteriors with MFCC post-eriors, yielding a relative improvement of 14.3% over an MFCC Tandem system. We analyze for different types of acoustic units the complementarity between Gabor features with long-term spectro-temporal modulation information in the mel-spectrogram and MFCC features with short-term temporal information in the cepstral domain. It is found that Gabor features are better for vowel recognition while MFCCs are better for consonants. This explains why their integration offers improvements. | |
| #2 | Using Spectro-Temporal Features to Improve AFE Feature Extraction for ASR |
| (International Computer Science Institute/University of California - Berkeley) (International Computer Science Institute/University of California - Berkeley) | |
| Previous work has shown that spectro-temporal features reduce WER for automatic speech recognition under noisy conditions. The spectro-temporal framework, however, is not the only way to process features in order to reduce errors due to noise in the signal. The two-stage mel-warped Wiener filtering method used in the "Advanced Front End'' (AFE), now a standard front end for robust recognition, is another way. Since the spectro-temporal approach can be applied to a noise-reduced spectrum, we wanted to explore whether spectro-temporal features could improve the performance of AFE for ASR. We show that computing spectro-temporal features after AFE processing results in a 45% relative improvement compared to AFE in clean conditions and a 6% to 30% improvement in noisy conditions on the Aurora2 clean training setup. | |
| #3 | Using Harmonic Phase Information to Improve ASR Rate |
| (Aholab Signal Processing Laboratory, University of the Basque Country) (Aholab Signal Processing Laboratory, University of the Basque Country) (Aholab Signal Processing Laboratory, University of the Basque Country) (Aholab Signal Processing Laboratory, University of the Basque Country) (Aholab Signal Processing Laboratory, University of the Basque Country) (Aholab Signal Processing Laboratory, University of the Basque Country) | |
| Spectral phase information is usually discarded in automatic speech recognition (ASR). The Relative Phase Shift (RPS), a novel representation of the phase information of the speech, has features which seem to be appropriate to improve the ASR recognition rate. In this paper we describe the RPS representation, discuss different ways to parameterize this information in a suitable way for the HMM modelling, and present the results of the evaluation experiments. WER improvements ranging from 12 to 22% open promising perspectives for the use of this information jointly with the classical MFCC parameterization. Index Terms: ASR, phase spectrum, harmonic analysis | |
| #4 | Speech Recognition using Long-Term Phase Information |
| (Toyohashi University of Technology) (Toyohashi University of Technology) (Toyohashi University of Technology) | |
| Current speech recognition systems use mainly amplitude spectrum-based features such as MFFC for acoustic feature parameters, while discarding phase spectral information. The results of perceptual experiments, however, suggested that phase spectral information based on long-term analysis includes certain linguistic information. In this paper, we propose the use of phase features based on long-term analysis for speech recognition. We use two types of parameters: the delta phase parameter as a group delay and analytic group delay features. Isolated word and continuous digit recognition experiments were performed, resulting in a greater than 90% word or digit accuracy for each of the experiments. The experimental results confirmed that a long-term phase spectrum includes sufficient information for recognizing speech. Furthermore, combining likelihoods of MFCC and long-term group delay cepstrum outperformed the baseline MFCC relatively 20% for clean speech. | |
| #5 | Low-dimensional Space Transforms of Posteriors in Speech Recognition |
| (Department of Cybernetics, University of West Bohemia) (Department of Cybernetics, University of West Bohemia) (Department of Cybernetics, University of West Bohemia) | |
| In this paper we present three novel posterior transforms with the primary goal to achieve a high reduction of a feature vector size. The presented methods transform the posteriors to 1,D or 2,D space. For such a high reduction ratio the usually applied methods fail to keep the discriminative information. Contrary, the presented methods were specifically designed to retain most of the discriminative information. In our experiments, we used several different combinations of feature extraction methods nowadays commonly used, i.e. the PLP features (augmented with delta and acceleration coefficients) and two kinds of MLP-ANN features: the bottleneck (BN) and posterior estimates (PE). The experiments were designed with special attention to the assessment of possible improvements of the performance when the PLP features are combined either with the BN features or with the PE features whose dimensionality was reduced using the proposed feature transforms. The performance of the designed transforms was tested on two different speech corpora: a telephone speech SpeechDat-East corpus and multi-modal Czech Audio-Visual corpus. | |
| #6 | Hierarchical Bottle Neck Features for LVCSR |
| (RWTH Aachen University) (RWTH Aachen University) (RWTH Aachen University) | |
| This paper investigates the combination of different neural network topologies for probabilistic feature extraction. On one hand, a five-layer neural network used in bottle neck feature extraction allows to obtain arbitrary feature size without dimensionality reduction by transform, independently of the training targets. On the other hand, a hierarchical processing technique is effective and robust over several conditions. Even though the hierarchical and bottle neck processing performs equally well, the combination of both topologies improves the system by 5% relative. Furthermore, the MFCC baseline system is improved by up to 20% relative. This behaviour could be confirmed on two different tasks. In addition, we analyse the influence of multi-resolution RASTA filtering and long-term spectral features as input for the neural network feature extraction. | |
| #7 | Hierarchical Neural Net Architectures for Feature Extraction in ASR |
| (Brno University of Technology,Brno, Czech Republic) (Brno University of Technology,Brno, Czech Republic) | |
| This paper presents the use of neural net hierarchy for feature extraction in ASR. The recently proposed Bottle-Neck feature extraction is extended and used in hierarchical structures to enhance the discriminative property of the features. Although many ways of hierarchical classification/feature extraction have been proposed, we restricted ourselves to use the outputs of the first stage neural network together with its inputs. This approach is evaluated on meeting speech recognition using RT'05 and RT'07 test sets. The evaluated hierarchical feature extraction brings consistent improvement over the use of just the first level neural net. | |
| #8 | Mutual Information analysis for feature and sensor subset selection in surface electromyography based speech recognition |
| (Raytheon BBN Technologies) (Raytheon BBN Technologies) (Raytheon BBN Technologies) | |
| In this paper, we investigate the use of surface electromyographic (sEMG) signals collected from articulatory muscles on the face and neck for performing automatic speech recognition. We present a systematic information-theoretic analysis for feature selection and optimal sensor subset selection. Our results indicate that Mel-cepstral frequency features are best suited for sEMG-based discrimination. Further, the sensor subset ranking obtained through the mutual information experiments are consistent with the results obtained from hidden Markov model based recognition. The framework presented here can be used for determining the best feature and sensor subset for a given speaker a priori, instead of determining them a posteriori from recognition experiments. We achieve a mean recognition accuracy of 80.6% with the best 5 sensor subset chosen by the MI analysis in comparison with 79.6% obtained from using all the sensors. | |
| #9 | Learning from human errors: Prediction of phoneme confusions based on modified ASR training |
| (University of Oldenburg) (University of Oldenburg) | |
| In an attempt to improve models of human perception, the recognition of phonemes in nonsense utterances was predicted with automatic speech recognition (ASR) in order to analyze its applicability for modeling human speech recognition (HSR) in noise. In the first experiments, several feature types are used as input for an ASR system; the resulting phoneme scores are compared to listening experiments using the same speech data. With conventional training, the highest correlation between predicted and measured recognition was observed for perceptual linear prediction features (r = 0.84). Secondly, a new training paradigm for ASR is proposed with the aim of improving the prediction of phoneme intelligibility. For this ‘perceptual training’, the original utterance labels are modified based on the confusions measured in HSR tests. The modified ASR training improved the overall prediction, with the best models (r = 0.89) exceeding those obtained with conventional training (r = 0.80). |