Tue-Ses1-P3:
Speech Production I: Various Approaches

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Tuesday 10:00 Place:International Conference Room C Type:Poster
Chair:G Ananthakrishnan
#1Speaking style dependency of formant targets
Akiko Amano-Kusumoto (Oregon Health & Science University)
John-Paul Hosom (Oregon Health & Science University)
Alexander Kain (Oregon Health & Science University)
Previous work on formant targets has assumed that these targets are independent of the speaking style. In this paper, we estimate consonant and vowel targets in a database of “clear” and “conversational” speech, using both style-independent and style-dependent models. The test-set errors and clustering of the estimated target values indicate that for this corpus, formant targets depend on the speaking style. As an application, the vowel classification accuracy was tested with both style-indepently and dependently based on observed formant values and estimated target values. Token-based style-independent classification shows greater accuracy for conversational speech (82.19%) than observed-value classification (73.97%).
#2Similarity of effects of emotions on the speech organ configuration with and without speaking
Tatsuya Kitamura (Konan University)
In this work we propose and verify a hypothesis on emotional speech production: emotions induce physical and physiological changes in the whole body including the speech organs, regardless of whether or not the person is speaking, and as a side effect, this changes the voice quality. To verify this hypothesis, we measured the speech organ configuration of actors simulating four emotions (neutral, hot anger, joy, and sadness) with and without speaking by MRI. The results showed that emotions affect the speech organ configuration, and the same tendency of changes was found regardless of whether or not the person was speaking.
#3A Study of Intra-Speaker and Inter-Speaker Affective Variability using Electroglottograph and Inverse Filtered Glottal Waveforms
Daniel Bone (Viterbi School of Engineering, University of Southern California, CA, USA)
Samuel Kim (Viterbi School of Engineering, University of Southern California, CA, USA)
Sungbok Lee (Department of Linguistics, University of Southern California, CA, USA)
Shrikanth Narayanan (Viterbi School of Engineering, University of Southern California, CA, USA)
It is well-known that different speakers utilize their vocal instruments in diverse ways to express linguistic intention with some paralinguistic coloring such as emotional quality. The study of voice source features, which describe the action of the vocal folds, is important for a deeper understanding of emotion encoding in speech. In this study we investigate inter and intra-speaker differences in voicing activities as a function of emotion using electroglottography (EGG) and inverse filtering technique. Results demonstrate that while voice quality features are good indicators of affective state, voice source descriptors vary in affective information across speakers. Glottal ratio measurements taken directly from the EGG signal are more reliable than measurements from the inverse-filtered glottal airflow signal, but the spectral harmonic amplitude differences of EGG are less useful than from inverse filtering.
#4Modal analysis of vocal fold vibrations using laryngotopography
Ken-Ichi Sakakibara (Department of Communication Disorders, Health Sciences University of Hokkaido)
Hiroshi Imagawa (Department of Otolaryngoloty, University of Tokyo)
Miwako Kimura (Department of Otolaryngoloty, University of Tokyo)
Hisayuki Yokonishi (Department of Otolaryngoloty, University of Tokyo)
Niro Tayama (Department of Otolaryngology, Head and Neck Surgery, National Center for Glogbal Health and Medicine)
In this paper, we propose a method for analyzing spatial characteristics of the larynx during phonation by high-speed digital imaging. The laryngotopography was applied to the high-speed digital images of normal subjects, and patients with paralysis and cyst. The results show various modes of vibration of the vocal folds particular to the patients with paralysis and cyst and usefulness of the laryngotopograph for clinical purposes.
#5Laryngeal Voice Quality in the Expression of Focus
Martti Vainio (Universty of Helsinki, Institute of Behavioural Sciences)
Matti Airas (Nokia Corp.)
Järvikivi Juhani (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Alku Paavo (Department of Signal Processing and Acoustics, Aalto University, Finland)
Prominence relations in speech are signaled by various ways including such phonetic means as voice fundamental frequency, intensity, and duration. A less studied acoustic feature affecting prominence is the so called voice quality which is determined by changes in the airflow caused by different laryngeal settings. We investigated the changes in voice quality with respect to linguistic prosodic signaling of focus in simple three word utterances. We used inverse filtering based methods for calculating and parametrizing the glottal flow in several different vowels and focus conditions. The results supported our hypothesis -- formed by an earlier study of voice quality changes in running speech -- that more prominent syllables are produced with a less tense voice quality and less prominent ones with a more tense phonation. We provide both physiological and linguistic explanations for the phenomena.
#6Laryngeal Characteristics during the Production of Geminate Consonants
Masako Fujimoto (Center for Corpus Development, National Institute for Japanese Language and Linguistics, Japan)
Kikuo Maekawa (Center for Corpus Development, National Institute for Japanese Language and Linguistics, Japan)
Seiya Funatsu (Science Information Center, Prefectural University of Hiroshima, Japan)
Analysis of high-speed digital video images showed that no apparent constriction or tense appeared in larynx and glottis during the production of geminate consonants. Glottal width for geminate consonants is slightly, but not much, wider than their singleton counterparts. Rather, the degree depends largely on consonant types. However, analysis of photo-electric glottogram showed that an interruption of glottal opening movement and/or abrupt cessation of preceding vowel are suggested to be involved during the production of geminate consonants.
#7Numerical study of turbulent flow-induced sound production in presence of a tooth-shaped obstacle: towards sibilant [s] physical modeling.
Julien Cisonni (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
Kazunori Nozaki (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
Annemie Van Hirtum (GIPSA-lab, UMR CNRS 5216, Grenoble Universities, France)
Shigeo Wada (The Center for Advanced Medical Engineering and Informatics, Osaka University, Japan)
The sound generated during the production of the sibilant [s] results from the impact of a turbulent jet on the incisors. Physical modeling of this phenomenon depends on the characterization of the properties of the turbulent flow within the vocal tract and of the acoustic sources resulting from the presence of an obstacle in the path of the flow. The properties of the flow-induced noise strongly depend on several geometric parameters of which the influence has to be determined. In this paper, a simplified vocal tract/tooth geometric model is used to carry out a numerical study on the flow-induced noise generated by a tooth-shaped obstacle placed in a channel. The performed simulations bring out a link between the level of the generated noise and the aperture of the constriction formed by the obstacle.
#8Morphological and predictability effects on schwa reduction: The case of Dutch word-initial syllables
Iris Hanique (Radboud University Nijmegen, The Netherlands; Max Planck Institute for Psycholinguistics, The Netherlands)
Barbara Schuppler (Radboud University Nijmegen, The Netherlands)
Mirjam Ernestus (Radboud University Nijmegen, The Netherlands; Max Planck Institute for Psycholinguistics, The Netherlands)
This corpus-based study shows that the presence and duration of schwa in Dutch word-initial syllables are affected by a word’s predictability and its morphological structure. Schwa is less reduced in words that are more predictable given the following word. In addition, schwa may be longer if the syllable forms a prefix, and in prefixes the duration of schwa is positively correlated with the frequency of the word relative to its stem. Our results suggest that the conditions which favor reduced realizations are more complex than one would expect on the basis of the current literature.
#9Acoustic-to-Articulatory Inversion based on Local Regression
Samer Al Moubayed (Centre for Speech Technology, Royal Institute of Technology (KTH), Stockholm, Sweden)
Ananthakrishnan G (Centre for Speech Technology, Royal Institute of Technology (KTH), Stockholm, Sweden)
This paper presents an Acoustic-to-Articulatory inversion method based on local regression. Two types of local regression, a non-parametric and a local linear regression have been applied on a corpus containing simultaneous recordings of positions of articulators and the corresponding acoustics. A maximum likelihood trajectory smoothing using the estimated dynamics of the articulators is also applied on the regression estimates. The average root mean square error in estimating articulatory positions, given the acoustics, is 1.56 mm for the non-parametric regression and 1.52 mm for the local linear regression. The local linear regression is found to perform significantly better than regression using Gaussian Mixture Models using the same acoustic and articulatory features.
#10Korean lenis, fortis, and aspirated stops: Effect of place of articulation on acoustic realization
Mirjam Broersma (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Unlike most of the world's languages, Korean distinguishes three types of voiceless stops, namely lenis, fortis, and aspirated stops. All occur at three places of articulation. In previous work, acoustic measurements are mostly collapsed over the three places of articulation. This study therefore provides acoustic measurements of Korean lenis, fortis, and aspirated stops at all three places of articulation separately. Clear differences are found among the acoustic characteristics of the stops at the different places of articulation.
#11Speech Synthesis by Modeling Harmonics Structure with Multiple Function
Toru Nakashika (Kobe University)
Ryuki Tachibana (IBM Research - Tokyo)
Masafumi Nishimura (IBM Research - Tokyo)
Tetsuya Takiguchi (Kobe University)
Yasuo Ariki (Kobe University)
In this paper, we present a new approach for the speech synthesis, in which speech utterances are synthesized using the parameters of spectro-modeling function (Multiple function). With this approach, only harmonic-parts are extracted from the phoneme spectrum, and the time-varying spectrum corresponding to the harmonics or sinusoidal components is modeled using the Multiple function. We introduce two types of the functions, and present the method to estimate the parameters of each function using the observed phoneme spectrum. In the synthesis stage, speech signals are generated from the parameters of the Multiple function. The advantage of this method is that it only requires a few speech synthesis parameters. We discuss the effectiveness of our proposed method through experimental results.
#12Physics of Body-Conducted Silent Speech – Production, Propagation and Representation of Non-Audible Murmur
Makoto Otani (Faculty of Engineering, Shinshu University)
Tatsuya Hirahara (Faculty of Engineering, Toyama Prefectural University)
The physical nature of weak body-conducted vocal-tract resonance signals called non-audible murmur (NAM) were investigated using numerical simulation and acoustic analysis of the NAM signals. Computational fluid dynamics simulation reveals that a weak vortex flow occurs in the supraglottal region when uttering NAM; a source of NAM is a turbulent noise source produced due to a vortex flow. Furthermore, computational acoustics simulation reveals that NAM signals attenuate 50 dB at 1 kHz consisting of 30-dB full-range attenuation due to air-to-body transmission loss and –10-dB/octave spectral decay due to a sound propagation loss within the body, which roughly equals to the measurement results.

top