Robust speech recognition using fusion techniques and adaptive filtering
American Journal of Applied Sciences, Feb, 2009 by Haddad S.A.R.Al-, Samad S.A., Hussain A., Ishak K.A., Noor A.O.A.
INTRODUCTION
Speech Recognition (SR) is a technique aimed at converting a speaker's spoken utterance into a text string or other applications. SR is still far from a solved problem. It is quoted that the best reported word-error rates on English broadcast news and conversational telephone speech are 10 and 20%, respectively (1). Meanwhile, error rates on conversational meeting speech are about 50% higher and much more under noisy conditions (2).
Robustness is a key research in speech recognition for the past 50 years. The main issues in robustness are invariance to extraneous background noise and channel conditions as well as speaker and accent variations (3). Recursive Least Squares (RLS) algorithm is used to improve the presence of speech in a background of noise. The RLS algorithm provides good performance for models with accurate initial information on a parameter or a state to be estimated (4). In many applications of noise cancellation the changes in signal characteristics could be quite fast. This requires the utilization of adaptive algorithms, which converge rapidly. From this perspective the best choice is the RLS (5). The beginning and end of a word should be detected by the system that processes the word after noise cancellation has been done.
Fusion pattern recognition is used such as with Dynamic Time Warping (DTW) and Hidden Markov Model (HMM). DTW is popularly used in speech recognition in the 70's and 80's (6),(7) and HMM is popular after 90's until now (8) Meanwhile, the fusion techniques started being used in the middle of 90's for complementing the benefits of each other (9),(10). There are a few types of fusion in speech recognition amongst them are HMM and Artificial Neural Network (ANN) (11) and HMM and Bayesian Network (BN) (12).
The algorithm is tested on Malay digit speech corpus. A hundred speakers were involved in this project each spoke with 10 repetitions for each digit. The Malay isolated digit are from 0-9 spoken as KOSONG, SATU, DUA, TIGA, EMPAT, LIMA, ENAM, TUJUH, LAPAN and SEMBILAN.
MATERIALS AND METHODS
The system begins with recording speech, RLS noise cancellation, end point detecting, framing, normalization, filtering, MFCC, weighting signal, time normalization, Vector Quantization (VQ) and labeling. Then HMM is used to calculate the reference patterns and DTW is used to normalize the training data with the reference patterns as in Fig. 1. In this paper Mel-Frequency Cepstral Coefficient (MFCC) is chosen as the feature because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity properties of the high-order cepstral coefficients (13).
[FIGURE 1 OMITTED]
[FIGURE 2 OMITTED]
WAV file was recorded for 60 speakers. Each speaker says KOSONG, SATU, DUA, TIGA, EMPAT, LIMA, ENAM, TUJUH, LAPAN and SEMBILAN with a second pause for each number.
The RLS was used in preprocessing for noise cancellation as shown in Fig. 2(14). The explanation for Fig. 2 is as follows:
n = Background noise of any type
[[^].n] = Noise correlated to n
s = Speech signal
d = Desired signal
W = Optimum filter weight matrix
y = Output of adaptive process
e = Error signal in ideal case (clean speech)
Figure 3 shows the results of using the RLS adaptive filtering to the noisy signal. Figure 3a, shows the amplitude of the noisy speech and Fig. 3b shows the amplitude after processing using RLS.
[FIGURE 3a OMITTED]
[FIGURE 3b OMITTED]
After getting the filtered noise speech sample, the first process is endpoint detection. For detection, two basic parameters are used: Zero Crossing Rate (ZCR) and short time energy. The energy parameter has been used in endpoint detection since the 1970's (15). By combining with the ZCR, speech detection process can be made very accurate (16).
For labeling the segmented speech frame the zero crossing and energy were applied to the frame. Unfortunately it contained some level of background noise due to the fact that energy for breath and surround can quite easily be confused with the energy of a fricative sound (17).
As a result, this algorithm performs almost perfect segmentation for voice recoded by male speakers. For recoding done at noisy places, segmentation problem happens because in some cases the algorithm produces different values caused by background noise. This causes the cut off for silence to be raised as it may not be quite zero due to noise being interpreted as speech. On the other hand for clean speech both zero crossing rate and short term energy should be zero for silent regions.
Feature extraction: Mel Frequency Cepstral Coefficients (MFCC) is chosen because of the sensitivity of the low order cepstral coefficients to overall spectral slope and the sensitivity properties of the high-order cepstral coefficient (18). Currently it is the most popular feature extraction method (18),(19). MFCC is produced after the recorded signal is pre-emphasized, framed and Hamming windowed. Then the signal is normalized and lowpass filtered. Lowpass filter is used to remove the potential artificial high frequencies appearing in their modulation spectrum due to transmission errors.
Most Recent Technology Articles
- TELECOMMUNICATIONS : TELECOMS PACKAGE LEAVES COMMISSION, EP AND COUNCIL IN DISCORD.
- TELECOMMUNICATIONS : MEPS PRESSED TO FINALISE TELECOMS PACKAGE.
- AUTHORS' RIGHTS : PARIS PUTS GRADUATED RESPONSE' ON AUDIOVISUAL COUNCIL'S AGENDA.
- RAIFFEISEN INFORMATIK BUY OF PC-WARE AUTHORISED.
- MOBILE TELEPHONY : REDING OBTAINS "STRONG AGREEMENT" ON ROAMING.
Most Recent Technology Publications
Most Popular Technology Articles
- What is precision air conditioning and why is it necessary?
- Business process re-engineering in the small firm: A case study
- BizRate to monitor in-store customer satisfaction for Office Depot stores - Market Intelligence
- Base course modification through stabilization using cement and bitumen
- Speed control of separately excited DC motor
Most Popular Technology Publications
Content provided in partnership with http://findarticles.com/source//

