Tue-Ses3-O3:
Speech and audio segmentation

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Tuesday 16:00 Place:201B Type:Oral
Chair:Yasunari Obuchi
16:00Fully Automatic Segmentation for Prosodic Speech Corpora
Sarah Hoffmann (Speech Processing Group, ETH Zurich, Switzerland)
Beat Pfister (Speech Processing Group, ETH Zurich, Switzerland)
While automatic methods for phonetic segmentation of speech can help with rapid annotation of corpora, most methods rely either on manually segmented data to initially train the process or manual post-processing. This is very time-consuming and slows down porting of speech systems to new languages. In the context of prosody corpora for text-to-speech (TTS) systems, we investigated methods for fully automatic phoneme segmentation using only the corpora to be segmented and an automatically generated transcription. We present a new method that improves the performance of HMM-based segmentation by correcting the boundaries between the training stages of the phoneme models with high precision. We show that, while initially aimed at single speaker corpora, it performs equally well for multi-speaker corpora.
16:20A Novel text-independent phonetic segmentation algorithm based on the Microcanonical Multiscale Formalism
Vahid Khanagha (INRIA Bordeaux Sud-Ouest)
Khalid Daoudi (INRIA Bordeaux Sud-Ouest)
Oriol Pont (INRIA Bordeaux Sud-Ouest)
Hussein Yahia (INRIA Bordeaux Sud-Ouest)
We propose a radically novel approach to analyze speech signals from a statistical physics perspective. Our approach is based on a new framework, the Microcanonical Multiscale Formalism (MMF), which is based on the computation of singularity exponents, defined at each point in the signal domain. The latter allows nonlinear analysis of complex dynamics and, particularly, characterizes the intermittent signature. We study the validity of the MMF for the speech signal and show that singularity exponents convey indeed valuable information about its local dynamics. We define an accumulative measure on the exponents which reveals phoneme boundaries as the breaking points of a piecewise linear-like curve. We then develop a simple automatic phonetic segmentation algorithm using piecewise linear curve fitting. We present experiments on the full TIMIT database. The results show that our algorithm yields considerably better accuracy than recently published ones.
16:40PHONE BOUNDARY DETECTION USING SAMPLE-BASED ACOUSTIC PARAMETERS
You-Yu Lin (Institute of Communication, National Chiao Tung University, Hsinchu, Taiwan, ROC)
Yih-Ru Wang (Institute of Communication, National Chiao Tung University, Hsinchu, Taiwan, ROC)
Yuan-Fu Liao (Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan, ROC)
A sample-based phone boundary detection algorithm is proposed in this paper. Some sample-based acoustic parameters are first extracted in the proposed method, including six sub-band signal envelopes, sample-based KL distance and spectral entropy. Then, the sample-based KL distance is used for boundary candidates pre-selection. Last, a supervised neural network is employed for final boundary detection. Experimental results using the TIMIT speech corpus showed that EERs of 13.2% and 15.1% were achieved for the training and test data sets, respectively. Moreover, 43.5% and 88.2% of boundaries detected were within 80- and 240-sample error tolerance from manual labeling results at the EER operating point.
17:00HMM-based Automatic Visual Speech Segmentation Using Facial Data
Utpala Musti (Université Nancy 2, LORIA)
Asterios Toutios (Université Nancy 2, LORIA)
Slim Ouni (Université Nancy 2, LORIA)
Vincent Colotte (Université Henri Poincaré Nancy 1, LORIA)
Brigitte Wrobel-Dautcourt (Université Henri Poincaré Nancy 1, LORIA)
Marie-Odile Berger (INRIA, LORIA)
We describe automatic visual speech segmentation using facial data captured by a stereo-vision technique. The segmentation is performed using an HMM-based forced alignment mechanism widely used in automatic speech recognition. The idea is based on the assumption that using visual speech data alone for the training might capture the uniqueness in the facial component of speech articulation, asynchrony (time lags) in visual and acoustic speech segments and significant coarticulation effects. This should provide valuable information that helps to show the extent to which a phoneme may affect surrounding phonemes visually. This should provide information valuable in labeling the visual speech segments based on dominant coarticulatory contexts.
17:20Bayes Factor Based Speaker Segmentation for Speaker Diarization
David Wang (Queensland University of Technology)
Robert Vogt (Queensland University of Technology)
Sridha Sridharan (Queensland University of Technology)
This paper proposes the use of the Bayes Factor as a distance metric for speaker segmentation within a speaker diarization system. The proposed approach uses a pair of constant sized, sliding windows to compute the value of the Bayes Factor between the adjacent windows over the entire audio. Results obtained on the 2002 Rich Transcription Evaluation dataset show an improved segmentation performance compared to previous approaches reported in literature using the Generalized Likelihood Ratio. When applied in a speaker diarization system, this approach results in a 5.1% relative improvement in the overall Diarization Error Rate compared to the baseline.
17:40Using High-level Information to Detect Key Audio Events in a Tennis Game
QIANG HUANG (University of East Anglia)
STEPHEN COX (University of East Anglia)
This paper describes how the detection of key audio events in a sports game (tennis) can be enhanced by the use of high-level information. High-level features are able to provide useful constraints on the detection procedure, and thus to improve detection performance. We define two types of event based information: event dependency and inter-event timing. These respectively characterize the identity of the next event and the time at which the next event will occur. Probabilistic models of high-level constraints are developed, and then integrated into our event detection framework. We test this approach on audio tracks extracted from two different tennis games. The results show that significant improvements in both accuracy and computational efficiency are obtained when applying high-level information.

top