Wed-Ses2-O3:
Speech Production III: Analysis

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Wednesday 13:30 Place:201B Type:Oral
Chair:Shrikanth Narayanan
13:30Relying on critical articulators to estimate vocal tract spectra in an articulatory-acoustic database
Daniel Felps (Department of Computer Science and Engineering, Texas A&M University)
Christian Geng (Department of Linguistics and English Language, University of Edinburgh)
Michael Berger (Centre for Speech Technology Research, University of Edinburgh)
Korin Richmond (Centre for Speech Technology Research, University of Edinburgh)
Ricardo Gutierrez-Osuna (Department of Computer Science and Engineering, Texas A&M University)
We present a new phone-dependent feature weighting scheme that can be used to map articulatory configurations (e.g. EMA) onto vocal tract spectra (e.g. MFCC) through table lookup. The approach consists of assigning feature weights according to a feature’s ability to predict the acoustic distance between frames. Since an articulator’s predictive accuracy is phone-dependent (e.g., lip location is a better predictor for bilabial sounds than for palatal sounds), a unique weight vector is found for each phone. Inspection of the weights reveals a correspondence with the expected critical articulators for many phones. The proposed method reduces overall cepstral error by 6% when compared to a uniform weighting scheme. Vowels show the greatest benefit, though improvements occur for 80% of the tested phones.
13:50INVESTIGATING ARTICULATORY SETTING—PAUSES, READY POSITION, AND REST—USING REAL-TIME MRI
Vikram Ramanarayanan (University of Southern California)
Dani Byrd (University of Southern California)
Louis Goldstein (University of Southern California)
Shrikanth Narayanan (University of Southern California)
We present a novel automatic procedure to analyze 'articulatory setting (AS)' or 'basis of articulation' using real-time magnetic resonance images (rt-MRI) of the human vocal tract recorded for read and spontaneously spoken speech. We extract relevant frames of inter-speech pauses (ISPs) and rest positions from MRI sequences of read and spontaneous speech and use automatically-extracted features to quantify areas of different regions of the vocal tract as well as the angle of the jaw. Significant differences were found between the ASs adopted for ISPs in read and spontaneous speech, as well as those between ISPs and absolute rest positions. We further contrast differences between ASs adopted when the person is ready to speak as opposed to an absolute rest position.
14:10Articulatory inversion of American English /r/ by conditional density modes
Chao Qin (University of California, Merced)
Miguel Carreira-Perpiñán (University of California, Merced)
Although many algorithms have been proposed for articulatory inversion, they are often tested on synthetic models, or on real data that shows very small proportions of nonuniqueness. We focus on data from the Wisconsin X-ray microbeam database for the American English textipa{/*r/} displaying multiple, very different articulations (retroflex and bunched). We propose a method based on recovering the set of all possible vocal tract shapes as the modes of a conditional density of articulators given acoustics, and then selecting feasible trajectories from this set. This method accurately recovers the correct textipa{/*r/} shape, while a neural network has errors twice as large.
14:30Can tongue be recovered from face? The answer of data-driven statistical models
Atef Ben Youssef (GIPSA-lab (Dept Parole & Cognition / ICP), UMR 5216, CNRS – Grenoble University, France)
Pierre Badin (GIPSA-lab (Dept Parole & Cognition / ICP), UMR 5216, CNRS – Grenoble University, France)
Gérard Bailly (GIPSA-lab (Dept Parole & Cognition / ICP), UMR 5216, CNRS – Grenoble University, France)
This study revisits the face-to-tongue articulatory inversion problem in speech. We compare the Multi Linear Regression method (MLR) with two more sophisticated methods based on Hidden Markov Models (HMMs) and Gaussian Mixture Models (GMMs), using the same French corpus of articulatory data acquired by ElectroMagnetoGraphy. GMMs give overall results better than HMMs, but MLR does poorly. GMMs and HMMs maintain the original phonetic class distribution, though with some centralisation effects, effects still much stronger with MLR. A detailed analysis shows that, if the jaw / lips / tongue tip synergy helps recovering front high vowels and coronal consonants, the velars are not recovered at all. It is therefore not possible to recover reliably tongue from face.
14:50Phrase-medial vowel devoicing in spontaneous French
Francisco Torreira (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)
Mirjam Ernestus (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)
This study investigates phrase-medial vowel devoicing in European French (e.g. /ty po/ [typo] 'you can'). Our spontaneous speech data confirm that French phrase-medial devoicing is a frequent phenomenon affecting high vowels preceded by voiceless consonants. We also found that devoicing is more frequent in temporally reduced and coarticulated vowels. Complete and partial devoicing were conditioned by the same variables (speech rate, consonant type and distance from the end of the AP). Given these results, we propose that phrase-medial vowel devoicing in French arises mainly from the temporal compression of vocalic gestures and the aerodynamic conditions imposed by high vowels.
15:10Exploring the Mechanism of Tonal Contraction in Taiwan Mandarin
Chierh Cheng (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)
Yi Xu (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)
Michele Gubian (Centre for Language & Speech Technology, Radboud University, Nijmegen, NL)
This study investigates the mechanism of tonal contraction when a disyllabic unit is merged into a monosyllable at fast speech rate in Taiwan Mandarin. Various degrees of contraction of bi-tonal sequences were elicited by manipulating speech rates. Functional Data Analysis was performed to compare trajectories of F0 and velocity in the contracted and non-contracted syllables. Preliminary results show that speakers always make an effort to produce the original tones, even in cases of extreme degrees of reduction. This finding militates against phonology-based accounts like the Edge-in model, according to which contraction is a process of deleting adjacent tonemes while leaving the non-adjacent tonemes intact.

top