Gesellschaft für Informatik e.V.

Lecture Notes in Informatics


10th International Conferenceon Innovative Internet Community Systems (I2CS) - Jubilee Edition 2010 - P-165, 247-256 (2010).

Gesellschaft für Informatik, Bonn
2010


Copyright © Gesellschaft für Informatik, Bonn

Contents

Improving ASR for continuous thai words using ANN/HMM

Maleerat Sodanil , Supot Nitsuwat and Choochart Haruechaiyasak

Abstract


The baseline system of an automatic speech recognition normally uses Mel- Frequency Cepstral Coefficients (MFCC) as feature vectors. However, for tonal language like Thai, tone information is one of the important features which can be used to improve the accuracy of recognition. This paper proposes a method for building an acoustic model for Thai-ASR using a combination of MFCC and tone information as an input feature vector. In addition, we apply Artificial Neural Network (ANN) multilayer perceptrons to estimate the posterior probabilities of a class model given a sequence of observation input. The performance of the ANN approach is compared with the Gaussian Mixture Model (GMM) used in the Hidden Markov Model Toolkit (HTK). The experiments were carried out with 2-grams and 3-grams of language model. The training and test data sets were prepared from reading speech of ten Aesop's stories from 5 male and 5 female speakers. The results showed that the proposed method can be used to improve the performance of Thai-ASR in term of reducing word error rate. 247 The challenge in Automatic speech Recognition (ASR) is how to improve the accuracy of speech recognition in term of performance of the algorithm. There are three main parts of ASR, the first one is feature extraction that extracts distinguished feature of speech utterance, the second is training model which is typically based on the Hidden Markov Model (HMM) framework and the third is decoder which finds the best probabilistic match between speech utterance and text transcription. For tonal language like Mandarin or Thai in which tone is important for specifying the meaning of speech utterance, therefore, tone information could be considered in the system in order to improve the accuracy of speech recognition. There have been some researches proposing tone recognition or classification [SM1999, SY1995, NB2002] for improving the accuracy of speech recognition [CT2006]. Although well-known Mel-frequency cepstral coefficients (MFCC) features and HMM are widely used as feature vectors, there are some concern about testing, combining and adapting them to improve the accuracy or performance of the system which may not depend on speakers or languages [XM2006, PA2008]. The HMM is a very powerful statistical method for characterizing the observed data samples of a discrete time series. In HMM, the states are not directly visible, but variables influenced by the state are visible. Each state has a probability distribution over the possible output observation. The state transitions are also probabilistic in nature. The complete HMM model is denoted as $λ$= (A,B,$π$). The HMM training procedure tries to estimate the value of state transition probability distribution (A), observation symbol probability density or emission probability (B), and initial state distribution $(π)$. The emission probability distribution function (PDF) estimates the probability with which a given observation has been generated. However, the standard HMM based on maximum likelihood criteria (ML) has some weakness caused by several assumptions which reduce


Full Text: PDF

Gesellschaft für Informatik, Bonn
ISBN 978-3-88579-259-8


Last changed 04.10.2013 18:31:26