Tue-Ses1-S1:
Special Session: Open Vocabulary Spoken Document Retrieval

This is the final program for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself.
Time:Tuesday 10:00 Place:301 Type:Special
Chair:Seiichi Nakagawa & Kiyoaki Aikawa
10:00Constructing Japanese Test Collections for Spoken Term Detection
Yoshiaki Itoh (Iwate Prefectural University)
Hiromitsu Nishizaki (University of Yamanashi)
Xinhui Hu (NICT)
Hiroaki Nanjo (Ryukoku University)
Tomoyosi Akiba (Toyohashi University of Technology)
Tatsuya Kawahara (Kyoto University)
Seiichi Nakagawa (Toyohashi University of Technology)
Tomoko Matsui (The Institute of Statistical Mathematics)
Yoichi Yamashita (Ritsumeikan University)
Kiyoaki Aikawa (Tokyo University of Technology)
Spoken Document Retrieval (SDR) and Spoken Term Detection have been one of hottest topics in spoken document processing society. TREC (Text Retrieval Conference) has dealt with SDR from 1996 [1] and NIST has already set up STD test collections and collected the results of attendees [2]. For the Japanese spoken documents processing has also needed such test collections for SDR and STD. We set up a working group for this purpose in SIG-SLP (Spoken Language Processing) of Information Processing Society of Japan. The working group has constructed and offered a test collection for SDR [3]. We are now constructing new test collections for STD that is going to be open for researchers. The paper introduces the policy, the outline, and the schedule of new test collections. Some comparison is performed with the NIST STD tasks.
10:15Japanese Spoken Term Detection Using Syllable Transition Network Derived from Multiple Speech Recognizers' Outputs
Satoshi Natori (University of Yamanashi)
Hiromitsu Nishizaki (University of Yamanashi)
Yoshihiro Sekiguchi (University of Yamanashi)
This paper proposes a spoken term detection using syllable transition network (STN) derived from multiple speech recognizers. An STN is similar to a sub-word based confusion network, which is derived from the output of a speech recognizer. The one we proposed is derived from the outputs of multiple speech recognition systems, which is well known to be robust to certain recognition errors and the out-of-vocabulary problem. Therefore, the STN should also be robust to recognition errors on the STD. This experiment showed that the STN was very effective at detecting out-of-vocabulary terms, improving detection rate to 83%, which was as high as the in-vocabulary term detection performance.
10:30Combining Chinese Spoken Term Detection Systems via Side-information Conditioned Linear Logistic Regression
Sha Meng (Spoken Language Processing Group, LIMSI-CNRS, France)
Wei-Qiang Zhang (Tsinghua University)
Jia Liu (Tsinghua University)
This paper examines the task of Spoken Term Detection (STD) for the Chinese language. We propose to use Linear Logistic Regression (LLR) to combine various Chinese STD systems built with different decoding units, detection units, features and phone sets. In order to solve the missing-sample problem in STD system combination, side-information reflecting the reliability of the scores for fusion is used to condition the parameters of the standard LLR model. In addition, a two-stage combination solution is proposed to overcome the data-sparse problem. The experimental results show that the proposed methods improve the overall detection performance significantly. Compared with the best single system, a relative 11.3% improvement is achieved.
10:45Metric Subspace Indexing for Fast Spoken Term Detection
Taisuke Kaneko (Toyohashi University of Technology)
Tomoyosi Akiba (Toyohashi University of Technology)
In this paper, we propose a novel indexing method for Spoken Term Detection (STD). The proposed method can be considered as using metric space indexing for the approximate string-matching problem, where the distance between a phoneme and a position in the target spoken document is defined. The proposed method does not require the use of thresholds to limit the output, instead being able to output the results in increasing order of distance. It can also deal easily with the multiple candidates obtained via Automatic Speech Recognition (ASR). The results of preliminary experiments show promise for achieving fast STD.
11:00Unsupervised Spoken-Term Detection with Spoken Queries Using Segment-based Dynamic Time Warping
Chun-an Chan (National Taiwan University)
Lin-shan Lee (National Taiwan University)
Spoken term detection is important for retrieval of multimedia and spoken content over the Internet. Because it is difficult to have acoustic/language models well matched to the huge quantities of spoken documents produced under various conditions, unsupervised approaches using frame-based dynamic time warping (DTW) has been proposed to compare the spoken query with spoken documents frame by frame. In this paper, we propose a new approach of unsupervised spoken term detection using segment-based DTW. Speech signals are segmented into sequences of acoustically similar segments using hierarchical agglomerative clustering, and a DTW procedure is formulated for segment sequences along with the clustering tree structures. In this way, the number of highly redundant parameters can be reduced, and the relatively unstable feature vectors can be replaced by more stable segments which describe the sequence of vocal track stages during the uttering process. Preliminary experiments indicate a high reduction of computation time as compared to frame-based DTW, although the slightly degraded detection performance implies much room for further improvements.
11:15Contextual Verification for Open Vocabulary Spoken Term Detection
Daniel Schneider (Fraunhofer IAIS, Germany)
Timo Mertens (Norwegian University of Science and Technology, Norway)
Martha Larson (Delft University of Technology, Netherlands)
Joachim Köhler (Fraunhofer IAIS, Germany)
In spoken term detection, subword speech recognition is a viable means for addressing the out-of-vocabulary (OOV) problem at query time. Applying fuzzy error compensation techniques is needed for coping with inevitable recognition errors, but can lead to high false alarm rates especially for short queries. We propose two novel methods which reject false alarms based on the context of the hypothesized result and the distance to phonetically similar queries. Using the proposed methods, we obtain an increase in precision of 11% absolute at equal recall.
11:30Augmented set of features for confidence estimation in spoken term detection
Javier Tejedor (HCTLab-UAM)
Doroteo Torre (ATVS-UAM)
Miguel Bautista (ATVS-UAM)
Simon King (CSTR-University of Edinburgh)
Dong Wang (CSTR-University of Edinburgh)
Jose Colas (HCTLab-UAM)
Discriminative confidence estimation along with confidence normalisation have been shown to construct robust decision maker modules in spoken term detection (STD) systems. Discriminative confidence estimation, making use of termdependent features, has been shown to improve the widely used lattice-based confidence estimation in STD. In this work, we augment the set of these term-dependent features and show a significant improvement in the STD performance both in terms of ATWV and DET curves in experiments conducted on a Spanish geographical corpus. This work also proposes a multiple lineal regression analysis to carry out the feature selection. Next, the most informative features derived from it are used within the discriminative confidence on the STD system.
11:45Cluster-Based Language Model for Spoken Document Retrieval Using NMF-based Document Clustering
Xinhui Hu (National Institute of Information and Communications Technology, Japa)
Ryosuke Isotani (National Institute of Information and Communications Technology, Japa)
Hisashi Kawai (National Institute of Information and Communications Technology, Japa)
Satoshi Nakamura (National Institute of Information and Communications Technology, Japa)
In this paper, a non-negative matrix factorization (NMF)-based document clustering approach is proposed for the cluster-based language model for spoken document retrieval. The retrieval language model comprises three different unigram models: a whole corpus collect-based unigram, document-based unigram, and a document clustering-based unigram. They are combined with double linear interpolations. Document clustering is realized via the NMF method; each document is clustered into an axis in which it has maximum projection in the latent semantic space derived by the NMF. The initialization of NMF, which is an important factor influencing NMF performance, is based on the clustered results of the K-means clustering approach. Using these approaches, retrieval experiments are conducted on a test collection from the corpus of spontaneous Japanese (CSJ). It is found that the proposed method significantly outperforms the conventional vector space model (VSM), the maximum improvement of the retrieval perform-ance (mean average precision: MAP) exceeds 36%, outstripping the conventional query likelihood model, which has improvement of 7.4%. It is also found that the proposed method surpasses the K-means clustering method when adequate initialization of NMF is used.

top