| Time: | Wednesday 16:00 | Place: | International Conference Room C | Type: | Poster |
| Chair: | Toshio Irino | ||||
| #1 | A Feature Extraction Method for Automatic Speech Recognition Based on the Cochlear Nucleus |
| (University of Western Australia) (University of Western Australia) | |
| Motivated by the human auditory system, a feature extraction method for automatic speech recognition (ASR) based on the differential processing strategy of the AVCN, PVCN and the DCN of the cochlear nucleus is proposed. The method utilizes a zero-crossing with peak amplitudes (ZCPA) auditory model as synchrony detector to discriminate the low frequency formants. It utilizes the mean rate information in the synapse processing to capture the very rapidly changing dynamic nature of speech. Additionally, a temporal companding method is utilized for spectral enhancement through two-tone suppression. We propose to separate synchrony detection from synaptic processing as observed in the parallel processing methodology in the cochlear nucleus. HMM recognition using isolated digits showed improved recognition rates in clean and in non-stationary noise conditions than the existing auditory model. | |
| #2 | A Phoneme Recognition Framework based on Auditory Spectro-Temporal Receptive Fields |
| (Johns Hopkins University) (Johns Hopkins University) (Johns Hopkins University) (Johns Hopkins University) (Johns Hopkins University) | |
| In this paper we propose to incorporate features derived using spectro-temporal receptive fields (STRFs) of neurons in the auditory cortex for the task of phoneme recognition. Each of these STRFs is tuned to different auditory frequencies, scales and modulation rates. We select different sets of STRFs which are specific for phonemes in different broad phonetic classes (BPC) of sounds. These STRFs are then used as spectro-temporal filters on spectrograms of speech to extract features for phoneme recognition. For the phoneme recognition task on the TIMIT database, the proposed features show an improvement of about 5% over conventional feature extraction techniques. | |
| #3 | Perceptual compensation for effects of reverberation in speech identification: A computer model based on auditory efferent processing |
| (Department of Computer Science, University of Sheffield, UK) (Department of Computer Science, University of Sheffield, UK) | |
| Human speech perception is remarkably robust to the effects of reverberation, due in part to mechanisms of perceptual constancy that compensate for the characteristics of the acoustic environment. A computer model of this phenomenon is described, which shows compensation for the effects of reverberation in a word identification task. The presence of reverberation is detected as a change in the mean-to-peak ratio of the simulated auditory nerve response. In turn, this leads to attenuation of peripheral auditory activity, which is achieved through an efferent feedback loop. The computer model provides a qualitative match to a range of perceptual data, suggesting that auditory mechanisms under efferent control might contribute to compensation for reverberation in particular speech identification tasks. | |
| #4 | Predicting human perception and ASR classification of word-final [t] by its acoustic sub-segmental properties |
| (Center for Language and Speech Technology) (Max Planck Institute for Psycholinguistics, Center for Language and Speech Technology) (Department of Language and Communication Studies, NTNU) (Department of Language and Communication Studies, NTNU) | |
| This paper presents a study on the acoustic sub-segmental properties of word-final /t/ in conversational standard Dutch and how these properties contribute to whether humans and an ASR system classify the /t/ as acoustically present or absent. In general, humans and the ASR system use the same cues (presence of a constriction, a burst, and alveolar frication), but the ASR system is also less sensitive to fine cues (weak bursts, smoothly starting friction) than human listeners and misled by the presence of glottal vibration. These data inform the further development of models of human and automatic speech processing. | |
| #5 | A speech in noise test based on spoken digits: Comparison of normal and impaired listeners using a computer model |
| (Department of Computer Science, University of Sheffield, UK) (Department of Computer Science, University of Sheffield, UK) (Department of Psychology, Essex University, UK) (Department of Psychology, Essex University, UK) (Department of Psychology, Essex University, UK) | |
| This paper describes a speech-in-noise test which is suitable for testing both human and machine speech recognition in noise. The test uses spoken digit triplets, presented in a range of babble backgrounds and signal-to-noise ratios (SNRs). The performance of a normal hearing (NH) and hearing impaired (HI) listener have been assessed using the test. Both listeners show a fall in performance with decreasing SNR, as well as a decrease in performance with an increase in the number of talkers in the babble background. A physiologically accurate computational auditory model has been tuned to match the NH and HI listeners, allowing their performance in the test to be modelled using a missing data-based automatic speech recognition (ASR) system. For the NH model we show a good match to the behaviour of the human listener. However, the computer model underestimates the digit test performance of the specific HI listener considered here. | |
| #6 | Evaluation of bone-conducted ultrasonic hearing-aid regarding transmission of paralinguistic information: A comparison with cochlear implant simulator |
| (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan) (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan) | |
| Human listeners can perceive speech signals in a voice-modulated ultrasonic carrier from a bone-conduction stimulator, even if the listeners are patients with sensorineural hearing loss. Considering this fact, we have been developing a bone-conducted ultrasonic hearing aid (BCUHA). The purpose of this study is to evaluate the usability of BCUHA regarding transmission of paralinguistic information. For this purpose, two series of listening experiments were conducted. One is a speaker’s intention identification experiment, the other is a speaker discrimination experiment. To compare performance of BCUHA to that of air-conduction (AC) and cochlear implant, both experiments were conducted under three conditions; BCUHA, AC, and cochlear implant simulator (CIsim). The results show that BCUHA can transmit intentions of speaker as well as CIsim. Also BCUHA can transmit speaker information better than CIsim. | |
| #7 | Challenging the Speech Intelligibility Index: Macroscopic vs. Microscopic Prediction of Sentence Recognition in Normal and Hearing-impaired Listeners |
| (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany) (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany) (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany) (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany) (Medizinische Physik, Carl-von-Ossietzky Universität Oldenburg, Germany) | |
| A “microscopic” model of phoneme recognition, which includes an auditory model and a simple speech recognizer, is adapted to model the recognition of single words within whole German sentences. “Microscopic” in terms of this model is defined twofold, first, as analyzing the particular spectro-temporal structure of the speech waveforms, and second, as basing the recognition of whole sentences on the recognition of single words. This approach is evaluated on a large database of speech recognition results from normal-hearing and sensorineural hearing-impaired listeners. Individual audiometric thresholds are accounted for by implementing a spectrally-shaped hearing threshold simulating noise. Furthermore, a comparative challenge between the microscopic model and the “macroscopic” Speech Intelligibility Index (SII) is performed using the same listeners’ data. The results are that both models show similar correlations of modeled Speech Reception Thresholds (SRTs) to observed SRTs. | |
| #8 | Does sentence complexity interfere with intelligibility in noise? Evaluation of the Oldenburg Linguistically and Audiologically Controlled Sentence Test (OLACS) |
| (Institute of Physics, CvO University of Oldenburg, Germany) (Institute of Physics, CvO University of Oldenburg, Germany) (Department of Modern Languages , CvO University of Oldenburg, Germany) (Department of Modern Languages , CvO University of Oldenburg, Germany) (Department of Modern Languages , CvO University of Oldenburg, Germany) (Department of Modern Languages , CvO University of Oldenburg, Germany) (Institute of Physics, CvO University of Oldenburg, Germany) | |
| The Oldenburg Linguistically and Audiologically Controlled Sentence Test (OLACS), which contains sentences with seven different grades of linguistic complexity, is introduced. The evaluation of this new German speech material was performed by presenting each sentence at three different SNRs to 36 normally hearing listeners. Sentence specific discrimination functions were calculated and for each sentence type 40 sentences were selected. Differences of up to 3dB occurred for the different grades of linguistic complexity. Interindividual differences occurred in speech recognition rates of up to 30%. Thus, the material on the one hand seems to be appropriate for examining the influence of sentence complexity on speech recognition both qualitatively as well as quantitatively. On the other hand the OLACS might be used for diagnostic purposes to differentiate e.g. across individual listeners. | |
| #9 | Intelligibility Predictions for Speech against Fluctuating Masker |
| (AIPA and Quality and Usability, Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany) (AIPA, Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany) (Quality and Usability Lab, Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany) | |
| The effect of masking due to fluctuating sources on speech intelligibility is a phenomenon difficult to predict. Intelligibility scores vary with the efficiency of the energetic masking while the linguistic content of the message and listener’s cognitive performances add to the general incertitude that peaks for the case of masking speech. The present contribution proposes a signal-based assessment of the energetic masking at the sentence level. A mapping onto the scale of the speech intelligibility index is established for stationary noise. Predictions are quantitatively compared with the results of an intelligibility test for speech-modulated noise. The model is independent of voices similarities and semantic features, two important sources of informational masking. | |
| #10 | An Effect of Formant Amplitude in Vowel Perception |
| (Department of Electrical and Intelligent Systems, Tohoku Institute of Technology, Japan) (Research Institute of Electrical Communication, Tohoku University, Japan) (Graduate School of Engineering, Tohoku University, Japan) (Research Institute of Electrical Communication, Tohoku University, Japan) | |
| A psycho-acoustical experiment was conducted using synthetic vowel-like stimuli to examine effect of formant amplitude in vowel perception. Nine combinations of formant frequencies were examined. For each combination, relative amplitude of the third to the second formants was modified in seven degrees. In eight of the nine combinations, perceived vowels were changed according to the formant amplitude although every formant frequency kept constant. Furthermore, this amplitude effect was observed even when frequency separation of the neighboring formants was greater than 3.5 Bark. The result suggested that formant amplitude is effective cue for vowel perception as well as formant frequency. | |
| #12 | Functional neuroimaging of brain regions sensitive to communication sounds in primates: A comparative summary |
| (Newcastle University) (Newcastle University) | |
| There is considerable brain imaging evidence on the neural substrates of speech in humans, but only recently has data for comparison become available on the brain regions that process communication signals in other primates. To obtain insights into the relationship between the substrates for communication in primates, we compared the results from several neuroimaging studies in humans with those that have recently been obtained from macaque monkeys and chimpanzees. We note a striking general correspondence between the primates on the pattern of brain regions that process species-specific vocalizations and the acoustics related to voice identity. |