| Time: | Tuesday 13:30 | Place: | 201A | Type: | Oral |
| Chair: | Haizhou Li | ||||
| 13:30 | Looking for relevant features for speaker role recognition |
| (IRIT- Université de Toulouse) (IRIT- Université de Toulouse) (IRIT- Université de Toulouse) (IRIT- Université de Toulouse) | |
| When listening to foreign radio or TV programs we are able to pick up some information from the way people are interacting with each others and easily identify the most dominant speaker or the person who is interviewed. Our work relies on the existence of clues about speaker roles in acoustic and prosodic low-level features extracted from audio files and from speaker segmentations. In this paper we describe an original language-independent method which achieves the recognition of 5 roles (Anchor, Journalist, Other, Punctual Journalist, Punctual Other) with an accuracy of 85% on a 13-hour corpus composed of 46 documents among which can be found different radio shows. A feature selection method is exploited in order to highlight the most relevant features for every speaker role. | |
| 13:50 | Prosodic Speaker Verification using Subspace Multinomial Models with Intersession Compensation |
| (Brno University of Technology) (Brno University of Technology) (Brno University of Technology) (Speech Technology and Research Laboratory, SRI International) (Brno University of Technology) | |
| We propose a novel approach to modeling prosodic features. Inspired by Joint Factor Analysis model (JFA), our model is based on the same idea of introducing subspace of model parameters. However, the underlying Gaussian Mixture distribution of JFA is replaced by multinomial distribution to model sequences of discrete units rather than continuous features. In this work, we use the subspace model as a feature extractor for support vector machines (SVMs), similar to the recently proposed JFA in total variability space. We can show the capability to reduce high-dimensional count vectors to low dimension while keeping system performance stable. With additional intersession compensation, we can improve 30% relative to the baseline system and reach an equal error rate of 8.8% on the NIST 2006 SRE dataset. | |
| 14:10 | The Estimation and Kernel Metric of Spectral Correlation for Text-Independent Speaker Verification |
| (iFly Speech Lab, University of Science and Technology of China, China&Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore) (Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore) (Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore) (Institute for Infocomm Research,Agency for Science, Technology and Research (A*STAR), Singapore) (iFly Speech Lab, University of Science and Technology of China, China) (iFly Speech Lab, University of Science and Technology of China, China) | |
| Gaussian mixture models (GMMs) are commonly used in text-independent speaker verification for modeling the spectral distribution of speech. Recent studies have shown the effectiveness of characterizing speaker information using just the mean vectors of the GMM in conjunction with support vector machine (SVM). This paper advocates the use of spectral correlation captured by covariance matrices, and investigates its effectiveness compared to and in complement with the mean vectors. We examine two approaches, i.e., homoscedastic and heteroscedastic modeling, in estimating the spectral correlation. We introduce two kernel metrics, i.e., Frobenius angle and log-Euclidean inner product, for measuring the similarity between speech utterances in terms of spectral correlation. Experiment conducted on the NIST 2006 speaker verification task shows that approximately 10% of improvement is achieved by using the spectral correlation in conjunction with the mean vectors. | |
| 14:30 | Improving Monaural Speaker Identification by Double-Talk Detection |
| (School of Computing, University of Eastern Finland) (Dept. of Electronic Systems, Aalborg University, Denmark) (School of Computing, University of Eastern Finland) (Dept. of Electronic Systems, Aalborg University, Denmark) (Dept. of Media Technology, Aalborg University, Denmark) (Dept. of Electronic Systems, Aalborg University, Denmark) (School of Computing, University of Eastern Finland) | |
| This paper describes a novel approach to improve monoaural speaker identification where two speakers are present in a single-microphone recording. The goal is to identify both of the underlying speakers in the given mixture. The proposed approach is composed of a double-talk detector (DTD) as a pre- processor and speaker identification back-end. We demonstrate that including the double-talk detector improves the speaker identification accuracy. Experiments on GRID corpus show that including the DTD improves average recognition accuracy from 96.53% to 97.43%. | |
| 14:50 | Exploring subsegmental and suprasegmental features for a text-dependent speaker verification in distant speech signals |
| (International Institute of Information Technology, Hyderabad, India) (Department of Computer Science and Engineering, Indian Institute of Technology Madras, India) (International Institute of Information Technology, Hyderabad, India) | |
| Existing automatic speaker verification (ASV) systems perform with high accuracy when the speech signal is collected close to the mouth of the speaker (< 1 ft). However, the performance of these systems reduces significantly when speech signals are collected at a distance from the speaker (2-6 ft). The objective of this paper is to address some issues in the processing of speech signals collected at a distance from the speaker, for text-dependent ASV system. An acoustic feature derived from short segments of speech signals is proposed for the ASV task. The key idea is to exploit the high signal-to-noise nature of short segments of speech in the vicinity of impulse-like excitations. We show that the proposed feature yields better performance of speaker verification than the mel-frequency cepstral coefficients (MFCCs). In addition, regions of high signal-to-reverberation ratio, duration and pitch information are used to improve the performance of the ASV system for distant speech. | |
| 15:10 | A Fast Implementation of Factor Analysis for Speaker Verification |
| (University of Science and Technology of China) (Shanda Innovation Institute) (Shanda Innovation Institute) (Shanda Innovation Institute) (University of Science and Technology of China) | |
| The problem of session variability in text-independent speaker verification has been tackled actively for a few years. The factor analysis approach has been successfully applied for solving the session variablity problem. However, it suffers from a large amount of computational overhead. In this paper, a fast implementation of factor analysis approach with GMM Gaussian pre-selection is proposed. In our method, the EM statistics are calculated only using the Gaussians selected by cluster UBM to improve the speed of estimating factor analysis model. Experimental results on the NIST SRE 2006 evaluation show that the presented approach can provide as much as a 7 to 8x speedup over the baseline factor analysis system with the similar performance. |