Mon-Ses3-O2:
Speaker characterization and recognition I

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Monday 16:00 Place:201A Type:Oral
Chair:William Campbell
16:00Simple and Efficient Speaker Comparison using Approximate KL Divergence
William Campbell (MIT Lincoln Laboratory)
Zahi Karam (MIT Lincoln Laboratory, DSPG Research Laboratory of Electronics at MIT)
We describe a simple, novel, and efficient system for speaker comparison with two main components. First, the system uses a new approximate KL divergence distance extending earlier GMM parameter vector SVM kernels. The approximate distance incorporates data-dependent mixture weights as well as the standard MAP-adapted GMM mean parameters. Second, the system applies a weighted nuisance projection method for channel compensation. A simple eigenvector method of training is presented. The resulting speaker comparison system is straightforward to implement and is computationally simple---only two low-rank matrix multiplies and an inner product are needed for comparison of two GMM parameter vectors. We demonstrate the approach on a NIST 2008 speaker recognition evaluation task. We provide insight into what methods, parameters, and features are critical for good performance.
16:20The IIR NIST SRE 2008 and 2010 Summed Channel Speaker Recognition Systems
Hanwu Sun (Institute for Infocomm Research)
Bin Ma (Institute for Infocomm Research)
Chien-Lin Huang (Institute for Infocomm Research)
Trung Hieu Nguyen (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)
This paper describes the IIR speaker recognition system for the summed channel evaluation tasks in the 2008 and 2010 NIST SREs. The system includes three main modules: voice activity detection, speaker diarization and speaker recognition. The front-end process employs a spectral subtraction based voice activity detection algorithm for effective speech frame selection. The speaker diarization system applied for the 2007 and 2009 NIST RTs is adopted for the summed channel speech segmentation. A hybrid purifying and clustering algorithm is used to cluster the summed channel speech into two speaker clusters. The GMM-SVM speaker recognition system is adopted to evaluate the performance with both MFCC and LPCC features. The system achieves competitive overall EER rates of 3.46% in the 1conv-summed task and 1.87% in the 8conv-summed task, respectively, while only all English trials are involved.
16:40Speaker Characterization Using Long-Term and Temporal Information
Chien-Lin Huang (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
Hanwu Sun (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
Bin Ma (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
Haizhou Li (Department of Human Language Technology, Institute for Infocomm Research, Singapore)
This paper presents new techniques for front-end analysis using long-term and temporal information for speaker recognition. We propose a long-term feature analysis strategy that averages short-time spectral features over a period of time in an effort to capture the speaker traits that are manifested over a speech segment longer than a spectral frame. We found that the moving averages of temporal information are effective in speaker recognition as well. The experiments on the 2008 NIST Speaker Recognition Evaluation dataset show the long-term and temporal information contribute to substantial EER reductions.
17:00Score-level Compensation of Extreme Speech Duration Variability in Speaker Verification
Sergio Perez-Gomez (Universidad Autonoma de Madrid)
Daniel Ramos-Castro (Universidad Autonoma de Madrid)
Javier Gonzalez-Dominguez (Universidad Autonoma de Madrid)
Joaquin Gonzalez-Rodriguez (Universidad Autonoma de Madrid)
In this work we aim at compensating the degrading effects of utterance length variability of speaker verification systems, which appear in many typical applications such as forensics. The paper concentrates in the score misalignments due to different utterance lengths, proposing several algorithms for its normalization. In order to test the proposed methods, we have built two corpora from NIST SRE 2006 and 2008 data to simulate high utterance length variability. Results show an improvement of the overall system performance for all the algorithms proposed, which is significant even when score normalization techniques such as T-Norm are used.
17:20Speaker Recognition Experiments using Connectionist Transformation Network Features
Alberto Abad (INESC-ID Lisboa, Portugal)
Isabel Trancoso (IST/INESC-ID Lisboa, Portugal)
The use of adaptation transforms common in speech recognition systems as features for speaker recognition is an appealing alternative approach to conventional short-term cepstral modelling of speaker characteristics. Recently, we have shown that it is possible to use transformation weights derived from adaptation techniques applied to the Multi Layer Perceptrons that form a connectionist speech recognizer. The proposed method - named Transformation Network features with SVM modelling (TN-SVM) - showed promising results on a sub-set of NIST SRE 2008 and allowed further improvements when it was combined with baseline systems. In this paper, we summarize the recently proposed TN-SVM approach and present new results. First, we explore two alternative approaches that may be used in the absence of high quality speech transcriptions. Second, we present results of the proposed approach with Nuisance Attribute Projection for session variability compensation.
17:40Speaker Recognition using Supervised Probabilistic Principal Component Analysis
Yun Lei (University of Texas at Dallas)
John Hansen (University of Texas at Dallas)
In this study, a supervised probabilistic principal component analysis (SPPCA) model is proposed in order to integrate the speaker label information into a factor analysis approach using the well-known probabilistic principal component analysis (PPCA) model under a support vector machine (SVM) framework. The latent factor from the proposed model is believed to be more discriminative than one from the PPCA model. The proposed model, combined with different types of intersession compensation techniques in the back-end, is evaluated using the National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) 2008 data corpus, along with a comparison to the PPCA model.

top