Wed-Ses2-P4:
Detection, classification, and segmentation

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Wednesday 13:30 Place:International Conference Room D Type:Poster
Chair:Giuseppe Riccardi
#1Audio-based Sports Highlight Detection by Fourier Local Auto-Correlations
Jiaxing Ye (Department of Computer Science, University of Tsukuba, Japan)
Takumi Kobayashi (National Institute of Advanced Industrial Science and Technology)
Tetsuya Higuchi (National Institute of Advanced Industrial Science and Technology)
In this paper, we present a novel methodology for sports highlight detection based on audio information. For processing the sounds of sports events, we propose a time-frequency feature extraction method computing local auto-correlations on complex Fourier values (FLAC). For highlights detection, we apply (complex) subspace method to the extracted FLAC features to detect the “exciting” scenes which occur sparsely in a background of “ordinary” periods. As an unsupervised learning algorithm, the subspace method maintains advantages that any prior knowledge and expensive-computation are not required. To evaluate the proposed method, we made experiments on a soccer match. The experimental results show the effectiveness of the proposed approach including robustness to environmental noise, low computation burden and promising performance.
#2Automatic Excitement-Level Detection for Sports Highlights Generation
Hynek Boril (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
Abhijeet Sangwan (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
Taufiq Hasan (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
John Hansen (Center for Robust Speech Systems (CRSS), University of Texas at Dallas (UTD))
The problem of automatic excitement detection in baseball videos is considered and applied to highlights generation. This paper focuses on detecting exciting events in the video using complementary information from the audio and video domains. First, a new measure for non-stationarity which is extremely effective in separating background from speech is proposed. This new feature is employed in a unsupervised GMM-based segmentation algorithm that identifies the commentators speech in the crowd background. Thereafter, the ``level-of-excitement'' is measured using features such as pitch, F1-F3 center frequencies, and spectral center of gravity extracted from the commentators speech. Our experiments show that these features are well correlated with human assessment of excitability. Furthermore, slow-motion replay and pitching-scenes from the video are also detected to estimate scene end-points. Finally, audio/video information is fused to rank-order scenes by ``excitability'' and generate highlights of user-defined time-lengths. The techniques described in this paper are generic and applicable to a variety of domains.
#3Detecting novel objects in acoustic scenes through classifier incongruence
Jörg-Hendrik Bach (University of Oldenburg)
Jörn Anemüller (University of Oldenburg)
In this study, a new generic framework for the detection and interpretation of disagreement (“incongruence”) between different classifiers [15] is applied to the problem of detecting novel acoustic objects in an office environment. Using a general model that detects generic acoustic objects (standing out from a stationary background) and specific models tuned to particular sounds expected in the office, a novel object is detected as an incongruence between the models: the general model detects it as a generic object, but the specific models can not identify it as any of the known office-related sources. The detectors are realized using amplitude modulation spectrogram and RASTA-PLP features with support vector machine classification. Data considered are speech and non-speech sounds embedded in real office background at signal-to-noise ratios (SNR) from +20 dB to -20 dB. Our approach yields approximately 90% hit rate for novel events at 20 dB SNR, 75% at 0 dB and reaches chance level below -10 dB.
#4A Multidomain Approach for Automatic Home Environmental Sound Classification
Stavros Ntalampiras (University of Patras)
Ilyas Potamitis (Technological Educational Institute of Crete)
Nikos Fakotakis (University of Patras)
This article presents a multidomain approach which addresses the problem of automatic home environmental sound recognition. The proposed system will be part of a human activity monitoring system which will be based on heterogeneous sensors. This work concerns the audio classification component and its primary role is to detect anomalous sound events. We compare the discriminative capabilities of three feature sets (MFCC, MPEG-7 low level descriptors and a novel set based on wavelet packets) with respect to the classification of ten sound classes. These are combined with state of the art generative techniques (GMM and HMM) for estimating the density function of each class. The highest average recognition rate is 95.7% and is achieved by the vector formed by all the feature sets juxtaposed.
#5Content-Based Advertisement Detection
Patrick Cardinal (CRIM)
Vishwa Gupta (CRIM)
Gilles Boulianne (CRIM)
Television advertising is widely used by companies to promote their products among the public but it is hard for an advertiser to know if its advertisements are broadcast as they should. For this reason, some companies are specialized in the monitoring of audio/video streams for validating that ads are broadcast according to what was requested and paid for by the advertiser. The procedure for searching specific ads in an audio stream is very similar to the copy detection task for which we have developed very efficient algorithms. This work reports results of applying our copy detection algorithms to the advertisement detection task. Compared to a commercial software, we detected 18% more advertisements and the system runs at 0.003x of real-time.
#6Identification of Abnormal Audio Events Based on Probabilistic Novelty Detection
Stavros Ntalampiras (University of Patras)
Ilyas Potamitis (Technological Educational Institute of Crete)
Nikos Fakotakis (University of Patras)
This paper exploits the novelty detection technique towards identifying hazardous situations. The proposed system elaborates on the audio part of the PROMETHEUS database which includes heterogeneous recordings and was captured under real-world conditions. Three types of environments were used: smart-home, indoors public space and outdoors public space. The multidomain set of descriptors was formed by the following features: MFCCs, MPEG-7 descriptors, Teager energy operator parameters and wavelet packets. We report detection results using three types of probabilistic novelty detection algorithms: universal GMM, universal HMM and GMM clustering. We conclude that the results are encouraging and demonstrate the superiority of the novelty detection approach against the classification one.
#7Lightly supervised recognition for automatic alignment of large coherent speech recordings
Norbert Braunschweiler (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)
Mark J.F. Gales (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)
Sabine Buchholz (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)
Large quantities of audio data with associated text such as audiobooks are nowadays available. These data are attractive for a range of research areas as they include features that go beyond the level of single sentences. The proposed approach allows high quality transcriptions and associated alignments of this form of data to be automatically generated. It combines information from lightly supervised recognition and the original text to yield the final transcription. The scheme is fully automatic and has been successfully applied to a number of audiobooks. Performance measurements show low word/sentence error rates as well as high sentence boundary accuracy.
#8Incremental Diarization of Telephone Conversations
Oshry Ben-Harush (Department of Electrical and Computers Engineering Ben-Gurion University of the Negev, Beer-Sheva, Israel)
Itshak Lapidot (Department of Electrical and Electronics Engineering Sami Shamoon College of Engineering, Ashdod, Israel)
Hugo Guterman (Department of Electrical and Computers Engineering Ben-Gurion University of the Negev, Beer-Sheva, Israel)
Speaker diarization systems attempt segmentation and labeling of a conversation between $R$ speakers, while no prior information is given regarding the conversation. Most state of the art diarization systems require the full body of the conversation data prior to the application of some diarization approach. However, for some applications such as forensics, which handles vast amount of data, an on-line or incremental diarization is of high importance. For that purpose, a two-stage incremental diarization of telephone conversations algorithm is suggested. On the first stage, a fully unsupervised diarization algorithm is applied over an initial training segment from the conversation. The second-stage is composed of time-series clustering of increments of the conversation. Applying incremental diarization over 1802 telephone conversations from NIST 2005 SER generated an increase in diarization error of approximately 2% compared to the diarization error of an off-line diarization system
#9Audio analytics by template modeling and 1-pass DP based decoding
Srikanth Cherla (Siemens Corporate Research & Technologies - India)
V Ramasubramanian (Siemens Corporate Research & Technologies - India)
We propose a novel technique for audio analytics and audio indexing using template based modeling of audio classes set in a one-pass dynamic programming continuous decoding framework. We propose use of concatenation costs in the one-pass DP recursions to reduce so-called incursion errors; we also propose selection of variable length templates for modeling indefinite duration audio classes using the segmental K-means (SKM) algorithm. Based on detailed decoding results with long audio streams, we conclude the effectiveness of template based modeling, SKM based template selection, 1-pass DP based decoding and the use of concatenation constraints therein. We show that an average (%Hit, %False-alarm) of (66%, 4.9%) are possible with the proposed decoding technique.
#10Perceptual Wavelet Decomposition for Speech Segmentation
Mariusz Ziolko (Department of Electronics, AGH University of Science and Technology, Krakow)
Jakub Galka (Department of Electronics, AGH University of Science and Technology, Krakow)
Bartosz Ziolko (Department of Electronics, AGH University of Science and Technology, Krakow)
Tomasz Drwiega (Faculty of Applied Mathematics, AGH University of Science and Technology, Krakow)
A non-uniform speech segmentation method based on wavelet packet transform is used for the localisation of phoneme boundaries. Eleven subbands are chosen by applying the mean best basis algorithm. Perceptual scale is used for decomposition of speech via Meyer wavelet in the wavelet packet structure. A real valued vector representing the digital speech signal is decomposed into phone-like units by placing segment borders according to the result of the multiresolution analysis. The final decision on localisation of the boundaries is made by analysis of the energy flows among the decomposition levels.
#11A comparative study of constrained and unconstrained approaches for segmentation of speech signal
Venkatesh Keri (International Institute of Information Technology, Hyderabad, India.)
Kishore Prahallad (International Institute of Information Technology, Hyderabad, India.)
In this work, we compare different approaches for speech segmentation, of which some are constrained and the remaining are unconstrained by phone transcript. A high accuracy speech segmentation can be obtained by approaches constrained by phone transcript such as HMM forced-alignment when {it exact phone transcript} is known. But such approaches have to adjust with {it canonical phone transcript}, as {it exact phone transcript} is tough to obtain. Our experiments on TIMIT corpus demonstrate that ANN and HMM phone-loop based unconstrained approaches, perform better than HMM forced-alignment based approach constrained by {it canonical phone transcript}. Finally a detailed error analysis of these approaches is reported.
#12Automatic discriminative measurement of voice onset time
Morgan Sonderegger (University of Chicago)
Joseph Keshet (Toyota Technological Institute at Chicago)
We describe a discriminative algorithm for automatic VOT measurement, considered as an application of predicting structured output from speech. In contrast to previous studies which use customized rules, in our approach a function is trained on manually labeled examples, using an online algorithm to predict the burst and voicing onsets (and hence VOT). The feature set used is customized for detecting the burst and voicing onsets, and the loss function used in training is the difference between predicted and actual VOT. Applied to initial voiceless stops from two corpora, the algorithm compares favorably to previous work, and the agreement between automatic and manual measurements is near human inter-judge reliability.
#13Selective Gammatone Filterbank Feature for Robust Sound Event Recognition
Yiren Leng (Institute for Infocomm Research, A*STAR, Singapore)
Huy Dat Tran (Institute for Infocomm Research, A*STAR, Singapore)
Norihide Kitaoka (Nagoya University, Japan)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)
This paper introduces a novel feature based on the raw output of the gammatone filterbank. Channel selection is used to enhance robustness over a range of signal-to-noise ratios (SNR) of additive noise. The recognition accuracy of the proposed feature is tested on a sound event database using a Hidden Markov Model (HMM) recogniser. A comparison with a series of similar features and the conventional Mel-Frequency Cepstral Coefficients (MFCC) shows that the proposed feature offers significant improvement in low SNR conditions.

top