| Time: | Thursday 10:00 | Place: | International Conference Room C | Type: | Poster |
| Chair: | Jean-Francois Bonastre | ||||
| #1 | Improved N-gram Phonotactic Models For Language Recognition |
| (LIMSI-CNRS) (LIMSI-CNRS) (LIMSI-CNRS) | |
| This paper investigates various techniques to improve the estimation of n-gram phonotactic models for language recognition using single-best phone transcriptions and phone lattices. More precisely, we first report on the impact of the so-called {it acoustic scale factor} on the system accuracy when using lattice-based training, and then we report on the use of n-gram cutoff and pruning techniques. Several system configurations are explored, such as the use of context-independent and context-dependent phone models, the use of single-best phone hypotheses versus phone lattices, and the use of various n-gram orders. Experiments are conducted using the LRE 2007 evaluation data and the results are reported using the a posteriori EER. The results show that the impact of these techniques on the system accuracy is highly dependent on the training conditions and that careful optimization can lead to performance improvements. | |
| #2 | A Study of Term Weighting in Phonotactic Approach to Spoken Language Recognition |
| (Chulalongkorn University,Thailand) (Institute for Infocomm Research, Singapore) (Institute for Infocomm Research, Singapore) (Chulalongkorn University,Thailand) (Chulalongkorn University,Thailand) (National Electronics and Computer Technology Center (NECTEC), Thailand) | |
| In the spoken language recognition approach of modeling phonetic lattice with the Support Vector Machine (SVM), term weighting on the supervector of N-gram probabilities is critical to the recognition performance because the weighting prevents the SVM kernel from being dominated by a few large probabilities. We investigate several term weighting functions that are used in text retrieval, which can incorporate the long-term semantic modeling in the short-term N-gram modeling. The functions are evaluated on the NIST 2007 Language Recognition Evaluation (LRE) task. Results suggest the term weighting with redundancy of term frequency (rd) which eliminates the redundancy of unit frequency co-occurrence across languages, and the combination of rd and logtf which demonstrates the effectiveness of combining the local and global weighting functions. | |
| #3 | Exploiting Context-Dependency and Acoustic Resolution of Universal Speech Attribute Models in Spoken Language Recognition |
| (Universita' degli Studi di Enna "Kore") (Gatech) (NTNU) (Gatech) | |
| This paper expands a previously proposed universal acoustic characterization approach to spoken language identification (LID) by studying different ways of modeling attributes to improve language recognition. The motivation is to describe any spoken language with a common set of fundamental units. Thus, a spoken utterance is first tokenized into a sequence of universal attributes. Then a vector space modeling approach delivers the final LID decision. Context-dependent attribute models are now used to better capture spectral and temporal characteristics. Also, an approach to expand the set of attributes to increase the acoustic resolution is studied. Our experiments show that the tokenization accuracy positively affects LID results by producing a 2.8% absolute improvement over our previous 30-second NIST 2003 performance. This result also compares favorably with the best results on the same task known by the authors when the tokenizers are trained on language-dependent OGI-TS data. | |
| #4 | Hierarchical Multilayer Perceptron based Language Identification |
| (Idiap Research Institute) (Idiap Research Institute) (Idiap Research Institute) | |
| Automatic language identification (LID) systems generally exploit acoustic knowledge, possibly enriched by explicit language specific phonotactic or lexical constraints. This paper investigates a new LID approach based on hierarchical multilayer perceptron (MLP) classifiers, where the first layer is a ``universal phoneme set MLP classifier''. The resulting (multilingual) phoneme posterior sequence is fed into a second MLP taking a larger temporal context into account. The second MLP can learn/exploit implicitly different types of patterns/information such as confusion between phonemes and/or phonotactics for LID. We investigate the viability of the proposed approach by comparing it against 2 standard approaches which use phonotactic and lexical constraints with the universal phoneme set MLP classifier as emission probability estimator. On SpeechDat(II) datasets of 5 European languages, the proposed approach yields significantly better performance compared to the 2 standard approaches. | |
| #5 | The NIST 2010 Speaker Recognition Evaluation |
| (NIST) (NIST) | |
| The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection begun in 1996. It utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium. Major test conditions to be examined include variations in channel, speech style, vocal effort, and the effect of speaker aging over a multi-year period. A new primary evaluation metric giving increased weight to false alarm errors compared to misses is being used. A small evaluation test with a limited number of trials is also being offered for systems that include human expertise in their processing. | |
| #6 | Bayesian Speaker Recognition Using Gaussian Mixture Model and Laplace Approximation |
| (Institute of Information Science, Academica Sinica, Taipei, Taiwan) (Institute of Information Science, Academica Sinica, Taipei, Taiwan) (Institute of Information Science, Academica Sinica, Taipei, Taiwan) | |
| This paper presents a Bayesian approach for Gaussian mixture model (GMM)-based speaker identification. Instead of evaluating the speaker score of a test speech utterance using a single data likelihood over the GMM learned by the point estimation methods according to the maximum likelihood or maximum a posteriori criteria, the Bayesian approach evaluates the score using the expectation of the data likelihood over the posterior distribution of the model parameters, which is depicted with Bayesian integration. However, the integration can not be derived analytically. Therefore, we apply Laplace approximation to the derivations. Theoretically, we show that the proposed Bayesian approach is equivalent to the GMM-UBM approach when infinite training data is available for each speaker. The results of speaker identification experiments on the TIMIT corpus show that the proposed Bayesian approach consistently outperforms GMM-UBM under very limited training data conditions. | |
| #7 | What Else is New Than the Hamming Window? Robust MFCCs for Speaker Recognition via Multitapering |
| (University of Eastern Finland) (University of Eastern Finland) (Lund University) (Lund University) | |
| Usually the mel-frequency cepstral coefficients (MFCCs) are derived via Hamming windowed DFT spectrum. In this paper, we advocate to use a so-called multitaper method instead. Multitaper methods form a spectrum estimate using multiple window functions and frequency-domain averaging. Multitapers provide a robust spectrum estimate but have not received much attention in speech processing. Our speaker recognition experiment on NIST 2002 yields equal error rates (EERs) of 9.66 % (clean data) and 16.41 % (-10 dB SNR) for the conventional Hamming method and 8.13 % (clean data) and 14.63 % (-10 dB SNR) using multitapers. Multitapering is a simple and robust alternative to the Hamming window method. | |
| #8 | Fast Computation of Speaker Characterization Vector using MLLR and Sufficient Statistics in Anchor Model Framework |
| (Indian Institute of Technology Madras) (Indian Institute of Technology Madras) | |
| Anchor modeling technique is shown to be useful in reducing computational complexity for speaker identification and indexing of large audio database,where speakers are projected onto a talker space spanned by a set of pre-defined anchor models represented by GMMs.The characterization of each speaker involves likelihood calculation with each anchor models and is therefore expensive even in the GMM-UBM frame work using top-C mixtures scoring.An computationaly efficient method is proposed here to calculate the likelihood of speech utterances using anchor speaker-specific MLLR matrix and sufficient statistics estimated from the utterance.Since anchor models use distance measures to identify speakers, they are used as a first stage to select N probable speakers and then cascaded by a conventional GMM-UBM system which finally identifies the speaker from this reduced set.The proposed method is 4.21x faster than the conventional cascade anchor system with comparable performance on NIST-04 SRE. | |
| #9 | Graph-Embedding for Speaker Recognition |
| (DSPG, RLE at MIT / MIT Lincoln Laboratory) (MIT Lincoln Laboratory) | |
| Popular methods for speaker classification perform speaker comparison in a high-dimensional space, however, recent work has shown that most of the speaker variability is captured by a low-dimensional subspace of that space. In this paper we examine whether additional structure in terms of nonlinear manifolds exist within the high-dimensional space. We will use graph embedding as a proxy to the manifold and show the use of the embedding in data visualization and exploration. ISOMAP will be used to explore the existence and dimension of the space. We also examine whether the manifold assumption can help in two classification tasks: data-mining and standard NIST speaker recognition evaluations (SRE). Our results show that the data lives on a manifold and that exploiting this structure can yield significant improvements on the data-mining task. The improvement in preliminary experiments on all trials of the NIST SRE Eval-06 core task are less but significant. | |
| #10 | A Hybrid Modeling Strategy for GMM-SVM Speaker Recognition System with Adaptive Relevance factor |
| (Institute for Infocomm Research) (Institute for Infocomm Research) (Institute for Infocomm Research) | |
| In Gaussian mixture model (GMM) approach to speaker recognition, it has been found that the maximum a posteriori (MAP) estimation is greatly affected by undesired variability due to varying duration of utterance as well as other hidden factors related to recording devices, session environment, and phonetic contents. We propose an adaptive relevance factor (RF) to compensate for this variability. In the other side, in realistic application, it is likely that the different channel corresponds to its different training and test conditions in terms of quantity and quality of the speech signals. In this connection, we develop a hybrid model that combines multiple complementary systems, each of which focuses on specific condition(s). We show the effectiveness of the proposed method on the core task of the National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2008. | |
| #11 | ROBUST MIXTURE MODELING USING T-DISTRIBUTION:APPLICATION TO SPEAKER ID |
| (Indian Institute of Science) (Indian Institute of Science) | |
| Robust stochastic modeling of speech is an important issue for the performance of practical applications. The Gaussian mixture model, GMM, is widely used in speaker ID, but its performance would get limited in the presence of unseen noise and distortions. Such noisy data, referred to as ”out-liers” for the original distribution, can be better represented by the use of heavy-tail distributions, such as Student’s t-distribution. It provides a natural choice in which the heavy-tail can be controlled using the degrees-of-freedom parameter. We explore finite mixture of t-distributions model (tMM), to represent noisy speech data and show its robustness for speaker ID, compared to GMM. Using the TIMIT and NTIMIT databases, the recognition accuracy obtained are 100% and 79.68% with a 34 mixture tMM respectively much better than those reported in the literature. | |
| #12 | A variable frame length and rate algorithm based on the spectral kurtosis measure for speaker verification |
| (School of Electrical and Electronic Engineering, Yonsei University, Korea) (Ming Hsieh Department of Electrical Engineering, University of Southern California, USA) (School of Electrical and Electronic Engineering, Yonsei University, Korea) (Ming Hsieh Department of Electrical Engineering, University of Southern California, USA) (School of Electrical and Electronic Engineering, Yonsei University, Korea) | |
| In this paper, we propose a spectral kurtosis based approach to extract features with a variable frame length and rate for speaker verification. Since the speaker-specific information of features in each frame changes depending upon the characteristics of speech, it is important to determine the appropriate frame length and rate to extract the salient feature frames. In order to distinctively represent the characteristics of vowels and consonants both in time and frequency domains, we introduce a variable frame length and rate (VFLR) method based on spectral kurtosis, which provides a local measure of time-frequency concentration. Experimental results verify that the proposed VFLR method improves the performance of the speaker verification system on the NIST SRE-06 database by 9.725% (relative) compared to the feature extraction method with the fixed length and rate. |