| Time: | Monday 16:00 | Place: | International Conference Room A | Type: | Poster |
| Chair: | Kikuo Maekawa | ||||
| #1 | Rhythm and Formant Features for Automatic Alcohol Detection |
| (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universitaet Muenchen) (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universitaet Muenchen) (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universitaet Muenchen) | |
| Two speech feature sets, RMS rhythmicity and formant frequencies F1-F4, are analyzed for their ability to distinguish alcoholized from sober speech. We describe the statistical framework based on the Alcohol Language Corpus (ALC), including other factors such as gender, age and speaking style, and its application to our case. Rhythm features are calculated using a new method based solely on the short-time energy function; formant features are derived using the standard formant tracker SNACK. Our findings indicate that 3 rhythm and 3 formant features have a high potential to detect intoxication within the speech data of a subject. We also tested the hypothesis that vowels are more centralized in the F1/F2 space for alcoholized speech, but found that, on the contrary, subjects tend to hyper-articulate when being tested for intoxication. | |
| #2 | An exploration of voice source correlates of focus |
| (Trinity College Dublin) (Trinity College Dublin) (Trinity College Dublin) (Trinity College Dublin) | |
| This pilot study explores how the voice source parameters vary in focally accented syllables. It examines the dynamics of the voice source parameters in an all-voiced short declarative utterance in which the focus placement was varied. The voice source parameters F0, EE, UP, OQ, RG, RA, RK and RD were obtained through inverse filtering and subsequent parameterisation using the LF-model. The results suggest that the focally accented syllables are marked not only by increased F0 but also by boosted EE, RG and UP. The non-focal realisations show reduced values for the above parameters along with a tendency towards higher OQ values, suggesting a more lax mode of phonation. | |
| #3 | Modeling perceived vocal age in American English |
| (University of Florida) (University of Florida) (University of Florida) | |
| An acoustic analysis of voice, articulatory, and prosodic cues to perceived age was completed for a speech database of 150 American English speakers. Perceived ages were submitted to multiple linear regression analyses with measures of acoustic correlates of: voice quality, articulation, fundamental frequency, and prosody. The fit between predicted and actual perceived ages from the resulting models varied by speech material and gender, with female vocal ages being the easiest to predict. Articulation, pitch, and speaking rate measures were the most predictive in female voices, while, for male voices, the observed ranking was: speaking rate, voice quality, and pitch. | |
| #4 | Multivariate Analysis of Vocal Fatigue in Continuous Reading |
| (Paris Descartes University - LIPADE) (Paris Sorbonne University - STIH) | |
| We present an experimental paradigm to measure changes in characteristics of speech under vocal fatigue. For speech corpora, we have chosen a vocal load (3 hours) and a cognitive process (reading aloud continuously) that can induce some fatigue of the reader. Fatigue is verified using an analysis of reading errors and disfluencies. A multivariate analysis based on Wilks' lambda test, of 169,042 occurrences of phonemes, can analyze spectral and prosodic changes of each phonetic class. Based on six readers, the results show that nasals (vowels and consonants) are the most discriminant phonemes in vocal fatigue. | |
| #5 | Frequency-Domain Delexicalization using Surrogate Vowels |
| (Center for Spoken Language Understanding, Oregon Health & Science University, Portland, Oregon, USA) (Center for Spoken Language Understanding, Oregon Health & Science University, Portland, Oregon, USA) | |
| We propose a delexicalization algorithm that renders the lexical content of an utterance unintelligible, while preserving important acoustic prosodic cues, as well as naturalness and speaker identity. This is achieved by replacing voiced regions by spectral slices from a surrogate vowel, and by averaging the magnitude spectrum during unvoiced regions. Perceptual tests were carried out comparing sentences that were either unprocessed or delexicalized, using a baseline or the proposed method. An intelligibility test resulted in a keyword recall rate of 92% for the unprocessed sentences, and near complete unintelligibility for both delexicalization methods. Affect recognition was at 65% for unprocessed sentences, and 46% and 49% for the baseline and the proposed method, respectively. Preference tests showed that the proposed method preserved drastically more speaker identity, and sounded more natural than the baseline. | |
| #6 | Emotion Recognition using Imperfect Speech Recognition |
| (Carnegie Mellon University) (Friedrich-Alexander-Universitaet Erlangen-Nuernberg) (Technische Universitaet Muenchen) (Technische Universitaet Berlin) (Technische Universitaet Muenchen) (Friedrich-Alexander-Universitaet Erlangen-Nuernberg) | |
| This paper investigates the use of speech-to-text methods for assigning an emotion class to a given speech utterance. Previous work shows that an emotion extracted from text can convey complementary evidence to the information extracted by classifiers based on spectral, or other non-linguistic features. As speech-to-text usually presents significantly more computational effort, in this study we investigate the degree of speech-to-text accuracy needed for reliable detection of emotions from an automatically generated transcription of an utterance. We evaluate the use of hypotheses in both training and testing, and compare several classification approaches on the same task. Our results show that emotion recognition performance stays roughly constant as long as word accuracy doesn't fall below a reasonable value, making the use of speech-to-text viable for training of emotion classifiers based on linguistics. | |
| #7 | A Novel Feature Extraction Strategy for Multi-stream Robust Emotion Identification |
| (CRSS: Center for Robust Speech Systems,Erik Jonsson School of Engineering and Computer Science,University of Texas at Dallas) (CRSS: Center for Robust Speech Systems,Erik Jonsson School of Engineering and Computer Science,University of Texas at Dallas) (CRSS: Center for Robust Speech Systems,Erik Jonsson School of Engineering and Computer Science,University of Texas at Dallas) | |
| In this study, we investigate an effective feature extraction front-end for improved emotion identification by speech in clean and noisy condition. First, we explore the application of the PMVDR feature for emotion characterization. Originally for accent/dialect and language identification (LID), PMVDR features are less sensitive to noise. Also developed for LID, shifted delta cepstral (SDC) approach can also be used as a means of incorporating additional temporal information about the speech into the feature vectors. As already known, super-segmental characteristics, such as pitch and intensity, can provide beneficial information to emotion recognition and we believe the improvement can be acquired from improved features. We performed evaluation on the Berlin database of emotion speech. The proposed system, PMVDR-SDC, outperforms the baseline system absolutely by 10.1%, which proves the validity of the approach. Furthermore, we find both PMVDR and SDC offers much better robustness in noisy condition than others, which is critical for the real application. | |
| #8 | Setup for Acoustic-Visual Speech Synthesis by Concatenating Bimodal Units |
| (Université Nancy 2, LORIA) (Université Nancy 2, LORIA) (Université Nancy 2, LORIA) (Université Henri Poincaré Nancy 1, LORIA) (Université Henri Poincaré Nancy 1, LORIA) (INRIA, LORIA) | |
| This paper presents preliminary work on building a system able to synthesize concurrently the speech signal and a 3D animation of the speaker's face. This is done by concatenating bimodal diphone units, that is, units that comprise both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. Unit selection is based on classic target and join costs from acoustic-only synthesis, which are augmented with a visual join cost. Preliminary results indicate the benefits of the approach, since both the synthesized speech signal and the face animation are of good quality. Planned improvements and enhancements to the system are outlined. | |
| #9 | Towards Affective State Modeling in Narrative and Conversational Settings |
| (University of Twente) (Delft Unversity of Technology) (University of Twente) (University of Twente) (University of Twente) | |
| We carry out two studies on affective state modeling for communication settings that involve unilateral intent on the part of one participant (the evoker) to shift the affective state of another participant (the experiencer). The first investigates viewer response in a narrative setting using a corpus of documentaries annotated with viewer-reported narrative peaks. The second investigates affective triggers in a conversational setting using a corpus of recorded interactions, annotated with continuous affective ratings, between a human interlocutor and an emotionally colored agent. In each case, we build a “one-sided” model using indicators derived from the speech of one participant. Our classification experiments confirm the viability of our models and provide insight into useful features. | |
| #10 | Detection of anger emotion in dialog speech using prosody feature and temporal relation of utterances |
| (NTT Cyber Space Laboratories, NTT Corporation) (NTT Cyber Space Laboratories, NTT Corporation) (NTT Cyber Space Laboratories, NTT Corporation) (NTT Cyber Space Laboratories, NTT Corporation) | |
| This paper proposes a novel feature for detecting anger in dialog speech. Anger is classified into two types; loud HotAnger and calm ColdAnger. Prosody can reliably detect the former but not the latter. We analyze both types of anger dialog in the two-party setting, and discover that they exhibit some differences in the temporal relation of utterances from neutral dialog. We create a dialog feature that reflects these differences, and investigate its effectiveness in detecting both types of anger. Tests show the proposed feature combination improves the F-measure of Cold and HotAnger by 24.4 points and 8.8 points against baseline technique that uses only prosody. | |
| #11 | Gesture and Speech Coordination: The Influence of the Relationship Between Manual Gesture and Speech |
| (gipsa-lab, UMR5216 CNRS) (gipsa-lab, UMR5216 CNRS) | |
| Communication is multimodal. In particular, speech is often accompanied by manual gestures. Moreover, their coordination to speech has often been related to prosody. The aim of this study was to further explore the coordination between prosodic focus and different manual gestures (pointing, beat and control gestures) on ten speakers using motion capture. | |
| #12 | Analysis and Detection of Cognitive Load and Frustration in Drivers’ Speech |
| (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas) (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas) (Speech and Audio Research Laboratory, Queensland University of Technology, Brisbane, Australia) (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas) | |
| Non-driving related cognitive load and variations of emotional state may impact the driver’s capability to control a vehicle and introduce driving errors. Availability of reliable cognitive load and emotion detection in drivers would benefit the design of active safety systems and other intelligent in-vehicle interfaces. In this study, speech produced by 68 subjects while driving in urban areas is analyzed. A particular focus is on speech production differences in two secondary cognitive tasks, interactions with a co-driver and calls to automated spoken dialog systems (SDS), and two emotional states during the SDS interactions - neutral/negative. A number of speech parameters are found to vary across the cognitive /emotion classes. Suitability of selected spectral- and production-based features for automatic cognitive task/emotion classification is investigated. A fusion of GMM/SVM classifiers yields an accuracy of 89% in cognitive task and 76% in emotion classification. | |
| #13 | Acoustic-Based Recognition of Head Gestures Accompanying Speech |
| (Advanced Industiral Science and Technology, AIST) (Advanced Industrial Science and Technology, AIST) (Advanced Industrial Science and Technology, AIST) | |
| Head movements are linked not only to symbolic gestures, such as head-nodding to represent “yes” or head-shaking to represent “no,” but also to the production of suprasegmental features of speech, such as stress, prominence, and other aspects of prosody. Recent studies have shown that head movements play a more direct role in the perception of speech. In this paper, we propose a novel method for recognizing head gestures that accompany speech. The proposed method tracks head movements that accompany speech by localizing the mouth position with a microphone array system. We also propose a recognition method for the mouth-position trajectory, in which Higher- Order Local Cross Correlation is applied to the trajectory. The recognition accuracy of the proposed method was on an average 90.25% for nineteen kinds of head gesture recognition tasks conducted in an open test manner, which outperformed the Hidden Markov Model-based method. | |
| #14 | Multimodal Dialog in the Car: Combining Speech and Turn-And-Push Dial to Control Comfort Functions |
| (German Research Center for Artificial Intelligence) (German Research Center for Artificial Intelligence) (University of the Saarland) (German Research Center for Artificial Intelligence) | |
| In this paper, we address the question how speech and tangible interfaces can be combined in order to provide effective multimodal interaction in vehicles, taking into account the special requirements induced by the circumstances of driving. Speech is used to set the interaction context and a turn-and-push dial is used to manipulate/adjust. An experimental study is presented that measures the distraction induced by manual, speech-only, and multimodal interaction (combination of speech and turn-and-push dial). Results show that while subjects where able to perform more tasks in the manual condition, their driving was significantly safer with using speech-only or multimodal dialog. Supplemental contributions of this paper are descriptions of how a multimodal dialog manager as well as a driving simulation software are connected to the CAN vehicle bus as well as how driver distraction caused by interacting with a system are measured using the standardized lane change task. | |
| #15 | Hands Free Audio Analysis from Home Entertainment |
| (Idiap Research Institute) (Idiap Research Institute) (Idiap Research Institute) | |
| In this paper, we describe a system developed for hands free audio analysis for a living room environment. It comprises detection and localisation of the verbal and paralinguistic events, which can augment the behaviour of virtual director and improve the overall experience of interactions between spatially separated families and friends. The results show good performance in reverberant environments and fulfil real-time requirements. | |
| #16 | Affective Story Teller: A TTS System for Emotional Expressivity |
| (Department of Information and Communication Engineering, University of Tokyo, Japan) (Department of Information and Communication Engineering, University of Tokyo, Japan) (Department of Information and Communication Engineering, University of Tokyo, Japan) | |
| This paper describes a system, Affective Story Teller (AST), as an example of emotionally expressive speech synthesizer. Our technique uses several linguistic resources that recognizes emotions in the input text according to its emotional affinity and assigns appropriate prosodic parameters as well as pitch accents by XML-based tagging to generate a synthesized speech sample. Then the synthesized sample is re-synthesized through TD-PSOLA based pitch manipulation in accordance to emotional connotation. The system employed MARY TTS system to readout a folk tale. The preliminary perceptual test results are encouraging and human judges, by listening to the re-synthesized speech samples of AST, could perceive ”happy”, “sad”, and “fear” emotions much better than compared to when they listened non-affective synthesized speech. |