Tue-Ses1-O4:
Emotional Speech

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Tuesday 10:00 Place:302 Type:Oral
Chair:Laurence Devillers
10:00Analysis of Excitation Source Information in Emotional Speech
S R M Prasanna (Indian Institute of Technology Guwahati)
D Govind (Indian Institute of Technology Guwahati)
The objective of this work is to analyze the effect of emotions on the excitation source of speech production. The neutral, angry, happy, boredom and fear emotions are considered for the study. Initially the electroglottogram (EGG) and its derivative signals are compared across different emotions. The mean, standard deviation and contour of instantaneous pitch, and strength of excitation parameters are derived by processing the derivative of the EGG and also speech using zero-frequency filtering (ZFF) approach. The comparative study of these features across different emotions reveals that the effect of emotions on the excitation source is distinct and significant. The comparative study of the parameters from the derivative of EGG and speech waveform indicate that both cases have the same trend and range, inferring any of them may be used. Use of the computed parameters are found to be effective in the prosodic modification task.
10:20Acoustic Feature Analysis in Speech Emotion Primitives Estimation
Dongrui Wu (University of Southern California)
Thomas Parsons (University of Southern California)
Shrikanth Narayanan (University of Southern California)
We recently proposed a family of robust linear and nonlinear estimation techniques for recognizing the three emotion primitives--valence, activation, and dominance--from speech. These were based on both local and global speech duration, energy, MFCC and pitch features. This paper aims to study the relative importance of these four categories of acoustic features in this emotion estimation context. Three measures are considered: the number of features from each category when all features are used in selection, the mean absolute error (MAE) when each category is used separately, and the MAE when a category is excluded from feature selection. We find that the relative importance is in the order of MFCC > Energy = Pitch > Duration. Additionally, estimator fusion almost always improves performance, and locally weighted fusion always outperforms average fusion regardless of the number of features used.
10:40Spectro-Temporal Modulations for Robust Speech Emotion Recognition
Lan-Ying Yeh (Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan 300, R.O.C.)
Tai-Shih Chi (Department of Electrical Engineering, National Chiao Tung University, Hsinchu, Taiwan 300, R.O.C.)
Speech emotion recognition is mostly considered in clean speech. In this paper, joint spectro-temporal features (RS features) are extracted from an auditory model and are applied to detect the emotion status of noisy speech. The noisy speech is derived from the Berlin Emotional Speech database with added white and babble noises under various SNR levels. The clean train/noisy test scenario is investigated to simulate conditions with unknown noisy sources. The sequential forward floating selection (SFFS) method is adopted to demonstrate the redundancy of RS features and further dimensionality reduction is conducted. Compared to conventional MFCCs plus prosodic features, RS features show higher recognition rates especially in low SNR conditions.
11:00Quantification of Prosodic Entrainment in Affective Spontaneous Spoken Interactions of Married Couples
Chi-Chun Lee (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Matthew Black (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Athanasios Katsamanis (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Adam Lammert (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Brian Baucom (Department of Psychology, University of Southern California, Los Angeles, CA, USA)
Andrew Christensen (Department of Psychology, University of California, Los Angeles, Los Angeles, CA, USA)
Panayiotis G. Georgiou (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, California, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL) and Department of Psychology, University of Southern California, Los Angeles, California, USA)
Interaction synchrony among interlocutors happens naturally as people adapt their speaking style gradually to promote efficient communication. In this work, we quantify one aspect of interaction synchrony - prosodic entrainment, specifically pitch and energy, in married couples' problem-solving interactions using speech signal-derived measures. Statistical testings demonstrate that some of these measures capture useful information; they show higher values in interactions with couple having high positive attitude compared to high negative attitude. Further, by using quantized entrainment measures employed with statistical symbol sequence matching in a maximum likelihood framework, we obtained 76% accuracy in predicting positive affect vs. negative affect.
11:20A Cluster-Profile Representation of Emotion Using Agglomerative Hierarchical Clustering
Emily Mower (University of Southern California)
Kyu Han (University of Southern California)
Sungbok Lee (University of Southern California)
Shrikanth Narayanan (University of Southern California)
The proper representation of emotion is critical in classification systems. In previous research, we demonstrated that emotion profile (EP) based representations are effective for this task. In EP-based representations, emotions are expressed in terms of underlying affective components from the subset of anger, happiness, neutrality, and sadness. The current study explores cluster profiles (CP), an alternate profile representation in which the components are no longer semantic labels, but clusters inherent in the feature space. This unsupervised clustering of the feature space permits the application of a system-level semi-supervised learning paradigm. The results demonstrate that CPs are similarly discriminative to EPs (EP classification accuracy: 68.37% vs. 69.25% for the CP-based classification). This suggests that exhaustive labeling of a representative training corpus may not be necessary for emotion classification tasks.
11:40Incremental Acoustic Valence Recognition: an Inter-Corpus Perspective on Features, Matching, and Performance in a Gating Paradigm
Bjoern Schuller (CNRS-LIMSI)
Laurence Devillers (CNRS-LIMSI)
It is not fully known how long it takes a human to reliably recognize emotion in speech from the beginning of a phrase. However, many technical applications demand for very quick system responses, e.g. to prepare different feedback alternatives before the end of a speaker turn in a dialog system. We therefore investigate this ‘gating paradigm’ employing two spoken language resources in a cross- and combined manner with a focus on valence: we determine how quick a reliable estimate is obtainable and whether matching by models trained on the same length of speech prevails. In addition we analyze how individual feature groups by type and derived functionals respond and find considerably different behavior. The language resources have been chosen to cover for manually segmented and automatically segmented speech at the same time. In the result one second of speech is sufficient on the datasets considered.

top