| Time: | Monday 16:00 | Place: | International Conference Room C | Type: | Poster |
| Chair: | Helen Meng | ||||
| #1 | Integration of Multilayer Regression with Structure-based Pronunciation Assessment |
| (The University of Tokyo) (Shenzhen Institutes of Advanced Technology) (The University of Tokyo) (The University of Tokyo) | |
| Automatic pronunciation assessment has several difficulties. Adequacy in controlling the vocal organs is often estimated from the spectral envelopes of input utterances but the envelope patterns are also affected by other factors such as speaker identity. Recently, a new method of speech representation was proposed where these non-linguistic variations are effectively removed through modeling only the contrastive aspects of speech features. This speech representation is called speech structure. However, the often excessively high dimensionality of the speech structure can degrade the performance of structure-based pronunciation assessment. To deal with this problem, we integrate multilayer regression analysis with the structure-based assessment. The results show higher correlation between human and machine scores and also show much higher robustness to speaker differences compared to widely used GOP-based analysis. | |
| #2 | Using Non-Native Error Patterns to Improve Pronunciation Verification |
| (Centre for Language and Speech Technology, Radboud University Nijmegen) (Centre for Language and Speech Technology, Radboud University Nijmegen) (Centre for Language and Speech Technology, Radboud University Nijmegen) | |
| In this paper we show how a pronunciation quality measure can be improved by making use of information on frequent pronunciation errors made by non-native speakers. We propose a new measure, called weighted Goodness of Pronunciation (wGOP), and compare it to the much used GOP measure. We applied this measure to the task of discriminating correctly from incorrectly realized Dutch vowels produced by non-native speakers and observed a substantial increase in performance when sufficient training material is available. | |
| #3 | Regularized-MLLR Speaker Adaptation for Computer-Assisted Language Learning System |
| (University of Tokyo) (University of Tokyo) (University of Tokyo) (Tokyo International Universiy) (Universiy of Tokyo) | |
| In this paper, we propose a novel speaker adaptation technique, regularized-MLLR, for Computer Assisted Language Learning (CALL) systems. This method uses the linear combination of a group of teachers’ transformation matrices to represent each target learner’s transformation matrix, thus avoids the over-adaptation problem that erroneous pronunciations come to be judged as good pronunciations after conventional MLLR speaker adaptation, which uses learners’ “imperfect” speech as target utterances of adaptation. Experiments of automatic scoring and error detection on public databases show that the pro-posed method outperforms conventional MLLR adaption in pronunciation evaluation and can avoid the problem of over adaptation. | |
| #4 | Automatic Evaluation of English Pronunciation by Japanese Speakers Using Various Acoustic Features and Pattern Recognition Techniques |
| (Toyohashi University of Technology, Japan) (Toyohashi University of Technology, Japan) | |
| In this paper, we propose a method for estimating a score for English pronunciation. Scores estimated by the proposed method were evaluated by correlating them with the teacher's pronunciation score.The average correlation between the estimated pronunciation scores and the teacher's pronunciation scores over 1, 5, and 10 sentences was 0.807, 0.873, and 0.921, respectively. When a text of spoken sentence was unknown, we obtained a correlation of 0.878 for 10 utterances. For English phonetic evaluation, we classified English phoneme pairs that are difficult for Japanese speakers to pronounce, using SVM, NN, and HMM classifiers. The correct classification ratios for native English and Japanese English phonemes were 94.6% and 92.3% for SVM, 96.5% and 87.4% for NN, 85.0% and 69.2% for HMM, respectively. We then investigated the relationship between the classification rate and a native English teacher's pronunciation score, and obtained a high correlation of 0.6 - 0.7. | |
| #5 | Decision Tree Based Tone Modeling with Corrective Feedbacks for Automatic Mandarin Tone Assessment |
| (Information and Communications Research Laboratories, Industrial Technology Research Institute) (Information and Communications Research Laboratories, Industrial Technology Research Institute) (Information and Communications Research Laboratories, Industrial Technology Research Institute) (Department of Applied Chinese Language and Literature, National Taiwan Normal University) (School of Electrical and Computer Engineering, Georgia Institute of Technology) | |
| We propose a novel decision tree based approach to Mandarin tone assessment. In most conventional computer assisted pronunciation training (CAPT) scenarios a tone production template is prepared as a reference with only numeric scores as feedbacks for tone learning. In contrast decision trees trained with an annotated tone-balanced corpus make use of a collection of questions related to important cues in categories of tone production. By traversing the corresponding paths and nodes associated with a test utterance a sequence of corrective comments can be generated to guide the learner for potential improvement. Therefore a detailed pronunciation indication or a comparison between two paths can be provided to learners which are usually unavailable in score-based CAPT systems. | |
| #6 | CASTLE: a Computer-Assisted Stress Teaching and Learning Environment for Learners of English as a Second Language |
| (Massey University,New Zealand) (Massey University,New Zealand) (University of Brunei Darussalam, Brunei Darussalam) (State Key Laboratory for Novel Software Technology, Nanjing University, China) (Tsinghua University, China) | |
| In this paper, we describe the principle and functionality of the Computer-Assisted Stress Teaching and Learning Environment (CASTLE) that we have proposed and developed to help learners of English as a Second Language (ESL) to learn stress patterns of English language. There are three modules in the CASTLE system. The first module, individualised speech learning material providing module, can provide learners individualised speech material that possesses their preferred voice features, e.g., gender, pitch and speech rate. The second module, perception assistance module, is intended to help learners correctly perceive English stress patterns, which can automatically exaggerate the differences between stressed and unstressed syllables in a teacher’s voice. The third module, production assistance module, is developed to help learners to make aware of the rhythm of English language and provide learners feedback in order to improve their production of stress patterns. | |
| #7 | Automatic reference independent evaluation of prosody quality using multiple knowledge fusions |
| (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science) (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science) (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science) (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science) (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Science) | |
| Automatic evaluation of GOR (Goodness Of pRosody) is a more advanced and challenging task in CALL (Computer Aided Language Learning) system. Apart from traditional prosodic features, we develop a method based on multiple knowledge sources without any prior condition of reading text. After speech recognition, apart from most state-of-the-art features in prosodic analysis, we cultivate more concise and effective feature set from the generation of prosody based on Fujisaki model, and influence of tempo in prosody—the variability of prosodic components based on PVI method. We also propose methods of boosting training without any annotation by mining larger corpus. Results in experiment investigate the GOR score on 1297 speech samples of excellent group of Chinese students aging from 14-16, we can draw several conclusions: On the one hand, adding the knowledge sources from generation and impact of prosody can contribute to 1.76% reduction in EER and 0.036 promotion in correlation than prosodic features alone; On the other hand, final result can be considerably improved by boosting training approach and topic-dependent scheme. | |
| #8 | Landmark-based Automated Pronunciation Error Detection |
| (Educational Testing Service) (University of Illinois at Urbana-Champaign) (Oregon Health and Science University) | |
| We present a pronunciation error detection method for second language learners of English (L2 learners). The method is a combination of confidence scoring at the phone level and landmark-based Support Vector Machines (SVMs). Landmark-based SVMs were implemented to focus the method on targeting specific phonemes in which L2 learners make frequent errors. The method was trained on the phonemes that are difficult for Korean learners and tested on intermediate Korean learners. In the data where non-phonemic errors occurred in a high proportion, the SVM method achieved a significantly higher F-score (0.67) than confidence scoring (0.60). However, the combination of the two methods without the appropriate training data did not lead to improvement. Even for intermediate learners, a high proportion of errors (40%) was related to these difficult phonemes. Therefore, a method that is specialized for these phonemes would be beneficial for both beginners and intermediate learners. | |
| #9 | HMM based TTS for Mixed language text |
| (University of Science and Technology of China) (Department of Computer Science, Tsinghua University, China) (IBM Research China) (University of Science and Technology of China) (Department of Computer Science, Tsinghua University) | |
| In current text content especially web contents, there are many mixed language contents, i.e. Mandarin text mixed with English words. To make the synthesized speech of mixed language contents sound natural, we need to synthesize the mixed languages content with a single voice. However, this task is very challenging because we can hardly find a talent who can speak both languages well enough. The synthesized speech will sound unnatural if the HMM based TTS is directly built with the non-native speakers’ training corpus. In this paper, we propose to use speaker adaptation technology to leverage the native speaker’s data to generate more natural speech for the non-native speaker. Evaluation results show that the proposed method can significantly improve the speaker consistency and naturalness of synthesized speech for mixed language text. | |
| #10 | An Analysis of Language Mismatch in HMM State Mapping-Based Cross-Lingual Speaker Adaptation |
| (Idiap Research Institute & Ecole Polytechnique Fédérale de Lausanne) (Idiap Research Institute) | |
| This paper provides an in-depth analysis of the impacts of language mismatch on the performance of cross-lingual speaker adaptation. Our work confirms the influence of language mismatch between average voice distributions for synthesis and for transform estimation and the necessity of eliminating this mismatch in order to effectively utilize multiple transforms for cross-lingual speaker adaptation. Specifically, we show that language mismatch introduces unwanted language-specific information when estimating multiple transforms, thus making these transforms detrimental to adaptation performance. Our analysis demonstrates speaker characteristics should be separated from language characteristics in order to improve cross-lingual adaptation performance. | |
| #11 | Classroom Note-taking System for Hearing Impaired Students using Automatic Speech Recognition Adapted to Lectures |
| (Kyoto University) (Kyoto University) (Kyoto University) (Kyoto University) | |
| We are developing a real-time lecture transcription system for hearing impaired students in university classrooms. The automatic speech recognition (ASR) system is adapted to individual lecture courses and lecturers, to enhance the recognition accuracy. The ASR results are selectively corrected by a human editor, through a dedicated interface, before presenting to the students. An efficient adaptation scheme of the ASR modules has been investigated in this work. The system was tested for a hearing-impaired student in a lecture course on civil engineering. Compared with the current manual note-taking scheme offered by two volunteers, the proposed system generated almost double amount of texts with one human editor. | |
| #12 | Exploring Web-Browser based Runtimes Engines for Creating Ubiquitous Speech Interfaces |
| (National Institute of Information and Communications Technology) (Tokyo Institute of Technology) | |
| This paper describes an investigation into current browser based runtimes including Adobe’s Flash and Microsoft’s Silverlight as platforms for delivering web based speech interfaces. The key difference here is the browser plugin is used to perform all the computation without any server side processing. The first application is an HMM based text-to-speech engine running in the Adobe Flash plugin. The second application is a WFST based large vocabulary speech recognition decoder written in C# running inside the Silverlight plugin. |