Thu-Ses1-P4:
Source localization and separation

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Thursday 10:00 Place:International Conference Room D Type:Poster
Chair:Hiroshi G. Okuno
#1Near Field Sound Source Localization Based on Cross-power Spectrum Phase Analysis with Multiple Channel Microphones
Kohei Hayashida (Graduate School of Science and Engineering, Ritsumeikan University)
Masanori Morise (College of Information and Science, Ritsumeikan University)
Takanobu Nishiura (College of Information and Science, Ritsumeikan University)
We study sound source localization in near field. In the research into sound source localization, 2D-MUSIC has already been developed. However, problematically its performance degrades in diffused noisy environments. The localization method based on CSP in near field has already been developed. However, the localization accuracy depends on the estimation accuracy for time delay between paired microphone. We have proposed 2D-CSP with multiple channel microphones for robustly localization. We carried out the evaluation experiment in a conference room. As a result of experiment, we confirmed that the proposed method can robustly localize a sound source more than conventional methods.
#2A Maximum a Posteriori Sound Source Localization in Reverberant and Noisy Conditions
Choi Jinho (KAIST)
Yoo Chang D. (KAIST)
In this paper, a maximum a posteriori sound source localization (MAP-SSL) is proposed in reverberant and noisy conditions. Incorporating a sparse prior related to the location of source into the existing maximum likelihood sound source localization (ML-SSL) framework, the proposed MAP-SSL algorithm is derived. In the proposed MAP-SSL algorithm, assuming the direction of an active source to be sparse in the space of all possible finite source directions, when a source is active, the criterion in deriving the proposed MAP-SSL algorithm is similar to the criterion used to derive the existing ML-SSL framework, except that in our criterion a sparse source prior that enforces a sparse source direction solution is added. The sparse source prior plays a key role in improving the SSL performance. The experimental results show the proposed MAP-SSL algorithm outperforms the variants of the ML-SSL framework.
#3Multichannel Source Separation Based on Source Location Cue with Log-Spectral Shaping by Hidden Markov Source Model
Tomohiro Nakatani (NTT Corporation)
Shoko Araki (NTT Corporation)
Takuya Yoshioka (NTT Corporation)
Masakiyo Fujimoto (NTT Corporation)
This paper proposes a multichannel source separation approach that exploits statistical characteristics of source location cues represented by inter-channel phase differences (IPD) and those of source log spectra represented by hidden Markov models (HMM). With this approach, source separation is achieved by iterating two simple sub-procedures, namely the clustering of the time-frequency (TF) bins into individual sources and the independent updating of the model parameters of each source. An advantage of this approach is that we can update the model parameters of each source independently of those of the other sources in each iteration, and thus the update can be computationally very efficient. We show by simulation experiments that the proposed method can greatly improve, in a computationally efficient manner, the quality of each source signal from sound mixtures in terms of cepstral distortion using an speaker independent HMM composed of very small number of states.
#4A DOA Estimation algorithm based on Equalization-Cancellation Theory
Duc Chau (School of Information Science, Japan Advanced Institute of Science and Technology)
Junfeng Li (School of Information Science, Japan Advanced Institute of Science and Technology)
Akagi Masato (School of Information Science, Japan Advanced Institute of Science and Technology)
Direction of arrival (DOA) estimation plays an important role in binaural hearing systems. Recently methods usually require a large array of microphones or do not adapt special conditions, e.g., humanoid robot with the effect of head-related transfer function. In this paper, we propose a two-microphone DOA estimation algorithm, namely EC-BEAM, which applies equalization-cancellation (EC) model into DOA estimation through beamforming-based technique. Specifically, the EC model is integrated into beamforming to remove the signal components from a given direction and yield the energy of the remained signals from other directions. Through searching several DOA candidates, the true DOA is determined as the direction at which the energy of the remained signals gets to minimum. Experimental results showed that EC-BEAM can not only well adapt to binaural hearing systems but also be able estimate much accurately the DOA of target signal in various noise conditions with only two microphones
#5Concurrent Speaker Localization using Multi-band Position-Pitch (M-PoPi) Algorithm with Spectro-Temporal Pre-Processing
Tania Habib (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria)
Harald Romsdorfer (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria)
Accurate, microphone-based speaker localization in real-world environments, like office spaces or meeting rooms, must be able to track a single speaker and multiple concurrent speakers in the presence of reverberations and background noise. Our Multiband Joint Position-Pitch (M-PoPi) algorithm for circular microphone arrays already shows a frame-wise localization estimation score of about 95% for tracking a single speaker in a noisy, reverberant setting. In this paper, we present two extensions of the M-PoPi algorithm to improve the localization estimation accuracy also for multiple concurrent speakers. These extensions are a weighted spectro-temporal fragment analysis as a pre-processing step for the M-PoPi algorithm and a particle filter-based tracking as a post-processing step. Experiments using real-world recordings of two concurrent speakers in a typically reverberant meeting room show an improvement of the frame-wise localization estimation score from 43% using the plain M-PoPi algorithm to 66% using the M-PoPi algorithm with both extensions.
#6On Using Gaussian Mixture Model for Double-Talk Detection in Acoustic Echo Suppression
Ji-Hyun Song (Inha University, Korea)
Kyu-Ho Lee (Inha University, Korea)
Yun-Sik Park (Inha University, Korea)
Sang-Ick Kang (Inha University, Korea)
Joon-Hyuk Chang (Inha University, Korea)
In this paper, we propose a novel frequency-domain approach to double-talk detection based on the Gaussian mixture model. In contrast to a previous approach based on a simple and heuristic decision rule utilizing time-domain cross-correlations, GMM is applied to a set of feature vectors extracted from the frequency-domain cross-correlation coefficients. Performance of the proposed approach is evaluated through objective tests under various environments, and better results are obtained as compared to the time-domain method.
#7Catalog-Based Single-Channel Speech-Music Separation
Cemil Demir (TUBİTAK-UEKAE)
Ali Taylan Cemgil (Computer Engineering Department, Bogazici University)
Murat Saraçlar (Electrical and Electronics Engineering Department, Bogazici University)
We propose a new catalog-based speech-music separation method for background music removal. Assuming that we know a catalog of the background music, we develop a generative model for the superposed speech and music spectrograms. We represent the speech spectrogram by a Non-negative Matrix Factorization (NMF) model and the music spectrogram by a conditional Poisson Mixture Model (PMM). By choosing the size of the catalog, i.e., the number of mixture components we can tradeoff speed versus accuracy. The combined hierarchical model leads to a mixture of multinomial distributions as the joint posterior of music and speech. Separation and hyper-parameter adaptation can be achieved via an Expectation Maximization algorithm. Experimental results show that separation performance of the algorithm is promising. Furthermore, we show that incorporating prior information such as volume adjustment parameter can enhance the separation performance.
#8Unvoiced Speech Segregation Based on CASA and Spectral Subtraction
Ke Hu (Department of Computer Science and Engineering, The Ohio State University)
DeLiang Wang (Department of Computer Science and Engineering, The Ohio State University)
Unvoiced speech separation is an important and challenging problem that has not received much attention. We propose a CASA based approach to segregate unvoiced speech from nonspeech interference. As unvoiced speech does not contain periodic signals, we first remove the periodic portions of a mixture including voiced speech. With periodic components removed, the remaining interference becomes more stationary. We estimate the noise energy in unvoiced intervals on the basis of segregated voiced speech. Spectral subtraction is employed to extract time-frequency segments in unvoiced intervals, and we group the segments dominated by unvoiced speech by simple thresholding or Bayesian classification. Systematic evaluation and comparison show that the proposed method considerably improves the unvoiced speech segregation performance under various SNR conditions.
#9Unsupervised sequential organization for cochannel speech separation
Ke Hu (Department of Computer Science and Engineering, The Ohio State University)
DeLiang Wang (Department of Computer Science and Engineering, The Ohio State University)
The problem of sequential organization in the cochannel speech situation has previously been studied using speaker-model based methods. A major limitation of these methods is that they require the availability of pretrained speaker models and prior knowledge (or detection) of participating speakers. We propose an unsupervised clustering approach to cochannel speech sequential organization. Given enhanced cepstral features, we search for the optimal assignment of simultaneous speech streams by maximizing the between- and within-cluster scatter matrix ratio penalized by concurrent pitches within individual speakers. A genetic algorithm is employed to speed up the search. Our method does not require trained speaker models, and experiments with both ideal and estimated simultaneous streams show the proposed method outperforms a speaker-model based method in both speech segregation and computational efficiency.

top