Signal processing and acoustic modelling of speech signals for speech recognition systems
|dc.contributor.author||Abdulla, Waleed H||en_NZ|
|dc.identifier.citation||Abdulla, W. H. (2002, March). Signal processing and acoustic modelling of speech signals for speech recognition systems (Thesis). Retrieved from http://hdl.handle.net/10523/1491||en|
|dc.description.abstract||Natural man-machine interaction is currently one of the most unfulfilled pledges of automatic speech recognition (ASR). The purpose of an automatic speech recognition system is to accurately transcribe or execute what has been said. State-of-the-art speech recognition systems consist of four basic modules: the signal processing, the acoustic modelling, the language modelling, and the search engine. The subject of this thesis is the signal processing and acoustic modelling modules. We pursue the modelling of spoken signals in an optimum way. The resultant modules can be used successfully for the subsequent two modules. Since the first order hidden Markov model (HMM) has been a tremendously successful mathematically established paradigm, which makes it the up-to-the-minute technique in current speech recognition systems, this dissertation bases all its studies and experiments on HMM. HMM is a statistical framework that supports both acoustic and temporal modelling. It is widely used despite making a number of suboptimal modelling assumptions, which put limits on its full potential. We investigate how the model design strategy and the algorithms can be adapted to HMMs. Large suites of experimental results are demonstrated to expound the relative effectiveness of each component within the HMM paradigm. This dissertation presents several strategies for improving the overall performance of baseline speech recognition systems. The implementation of these strategies was optimised in a series of experiments. We also investigate selecting the optimal feature sets for speech recognition improvement. Moreover, the reliability of human speech recognition is attributed to the specific properties of the auditory presentation of speech. Thus, in this dissertation, we explore the use of perceptually inspired signal processing strategies, such as critical band frequency analysis. The resulting speech representation called Gammatone cepstral coefficients (GTCC) provides relative improvement over the baseline recogniser. We also investigate multiple signal representations for recognition in an ASR to improve the recognition rate. Additionally, we developed fast techniques that are useful for evaluation and comparison procedures between different signal processing paradigms. The following list gives the main contributions of this dissertation: • Speech/background discrimination. • HMM initialisation techniques. • Multiple signal representation with multi-stream paradigms. • Gender based modelling. • Feature vectors dimensionality reduction. • Perceptually motivated feature sets. • ASR training and recognition packages for research and development. Many of these methods can be applied in practical applications. The proposed techniques can be used directly in more complicated speech recognition systems by introducing their resultants to the language and search engine modules.||en_NZ|
|dc.subject||automatic speech recognition||en_NZ|
|dc.subject||signal processing strategies||en_NZ|
|dc.subject||critical band frequency analysis||en_NZ|
|dc.subject||speech recognition systems||en_NZ|
|dc.subject.lcsh||T Technology (General)||en_NZ|
|dc.subject.lcsh||Q Science (General)||en_NZ|
|dc.title||Signal processing and acoustic modelling of speech signals for speech recognition systems||en_NZ|
|dc.description.references||Abdulla, W. H. and N. K. Kasabov (1999a). Speech recognition enhancement via robust CHMM speech background discrimination. Proc. ICONIP/ANZIIS/ANNES'99 International Workshop, New Zealand. Abdulla, W. H. and N. K. Kasabov (1999b). Two pass hidden Markov model for speech recognition systems. Proc. ICICS'99, Singapore. Abdulla, W. H. and N. K. Kasabov (1999c). The concepts of hidden Markov model in speech recognition. IJCNN'99, N. K. Kasabov, W. H. Abdulla (Ed.), Washington, DC, July 10-16, Chapter 4. Abdulla, W. H. and N. K. Kasabov (2000). Feature selection for parallel CHMM speech recognition systems. Proc. of the Fifth Joint Conference on Information Sciences, vol.2, pp 874-878, Atlantic City, New Jersey, USA. Abdulla, W. H. and N. K. Kasabov (2001). Improving speech recognition performance through gender separation. Proc. Artificial Neural Networks and Expert Systems International Conference (ANNES), pp 218- 222, Dunedin, New Zealand. Abramson, N. (1963). Information Theory and Coding. New York, McGraw-Hill. Abrash, V., H. Franco, M. Cohen, N. Morgan, et al. (1992). "Connectionist gender adaptation in hybrid neural network/hidden Markov model speech recognition system." Proc. ICSLP'92. Aertsen, A. and P. Johannesma (1980). "Spectra-temporal receptive fields of auditory neurons in the grass frog. I. Characterization of tonal and natural stimuli." Biol. Cybern. 38: 223 - 234. Allen, J. B. (1995). Speech and hearing in communication. New York, ASA edition, Acoustical Society of America. Alsabti, K., S. Ranka and V. Singh (1999). An efficient K-means clustering algorithm. IPPS/SPDP Workshop on High Performance Data Mining, San Juan, Puerto Rico. Arai, T. and S. Greenberg (1998). Speech intelligibility in the presence of cross channel spectral asynchrony. Proc. IEEE ICASSP'98. Atal, B. S. (1972). "Automatic speaker recognition based on pitch contours." J. Acoust. Soc. Am. 52: 1687-1697. Ata, B. S. and L. R Rabiner (1976). "A pattern recognition approach to voiced-unvoiced-silence classification with application to speech recognition." IEEE Trans. ASSP 24(4): 201-212. Bahl, L. R., P. F. Brown, P. V. de Souza and R L. Mercer (1988a). Speech recognition with continuous parameter hidden Markov models. Proc. IEEE ICASSP'88, New York, NY. Bahl, L. R., P. F. Brown, P. V. de Souza, R. L. Mercer, et al. (1988b). Acoustic Markov models used in Tangora speech recognition system. Proc. IEEE ICASSP'88, New York, USA. Baker, J. K. (1975a). "The DRAGON system - an overview." IEEE Trans. ASSP 23: 24-29. Baker, J. K. (1975b). "The Dragon system - an overview." IEEE Trans. ASSP 23(1): 24-29. Baker, J. K., Ed. (1975c). Stochastic modeling for automatic speech understanding. Speech Recognition: Invited paper presented at the 1974 IEEE symposium. New York, Academic Press. Barnwell-III, T. P. (1980). A comparison of parametrically different objective speech quality measures using correlation analysis with subjective quality results. Proc. IEEE ICASSP'80, Denver. Bateman, D. C., D. K. Bye and M. J. Hunt (1992). Spectral contrast normalization and other techniques for speech recognition in noise. Proc. IEEE ICASSP'92, San Francisco, USA. Baum, L. E. (1972). "An inequality and associated miximization technique in statistical estimation for probabilistic functions of Markov processe." Proc. Symp. On Inequalities 3: 1-7. Baum, L. E. and J. A. Egon (1967). "An inequality with applications to statistical estimation for probabilistic functions of Markov process and to a model for ecology." Bull. Amer. Meteorol. Soc. 73: 360-363. Baum, L. E. and T. Petrie (1966). "Statistical inference for probabilistic functions of finite state Markov chains." Ann. Math. Stat. 37: 1554-1563. Baum, L. E., T. Petrie, G. Soules and N. Weiss (1970). "A maximization technique occurring in the statistical analysis of probabilistic functions of markov chains." Annals of Mathematical Statistics 41(1): 164-171. Becchetti, C. and L. P. Ricotti (1999). Speech recognition theory and C++ implementation, John Wiley & Sons. Bellegarda, J. and D. Nahamoo (1989). Tied mixture continuous parameter models for large vocabulary isolated speech recognition. Proc. ICASSP'89, Glasgow, Scotland. Bellegarda, J. and D. Naharnoo (1990). "Tied mixture continuous parameter modeling for speech recognition." IEEE Trans. ASSP 38(12): 2033-2045. Bengio, S. and Y. Bengio (2000a). "Taking on the curse of dimentionality in joint distributions using neural networks." IEEE Trans. Neural Networks 11(3): 550-557. Bengio, Y. (1996). Neural Networks for Speech and Sequence Processing, International Thomson Computer Press. Bengio, Y. and S. Bengio (2000b). Modeling high-dimensional discrete data with multi-layer neural networks. Advances in Neural Information Processing Systems 12. S. A. Jolla, T. K. Leen and K.-R. Miler, MIT Press: 400-406. Bin, J., T. Calhurst, A. EI-Jaroudi, R. lyer, et al. (1999). Recent experiments in large vocabulary conversational speech recognition. Proc. IEEE ICASSP'99, Phoenix. Blimes, J. A. (1998). A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models. Berkeley, CA, International Computer Science Institute. Bocchieri, E. and B. Mak (1997). Subspace distribution clustering for continuous observation density hidden Markov models. Proc. Eurospeech. Boll, S. F. (1979). "Suppression of acoustic noise in speech using spectral subtraction." IEEE Trans. ASSP 27(2): 113-120. Bou-Ghazale, S. E. and A. 0. Asadi (2000). Hands-free voice activation of personal communication devices. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Bourlard, H., S. Bengio and K. Weber (2001). New approaches towards robust and adaptive speech recognition. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 751-757. Bourlard, H., S. Dupont and C. Ris (1996). Multi-stream speech recognition. Bourlard, H. and N. Morgan (1993). Connectionist Speech Recognition. A Hybrid Approach. Boston, Kluwer Academic Publishers. Bourlard, H. and N. Morgan (1994). Connectionist Speech Recognition, Kluwer Academic Publishers. Bourlard, H., C. J. Wellekens and H. Ney (1984). Connected digit recognition using vector quantization. Proc. IEEE ICASSP'84, San Diego, USA. Burton, D. K. and J. E. Shore (1985). "Speaker-dependent isolated word recognition using speaker-independent vector quantization coodbooks augmented with speaker-specific data." IEEE Trans. ASSP 33(2): 440-443. Chan, A. K. and S. J. Liu (1998). Wavelet Toolware: Software for Wavelet Training, Academic Press. Chen, S. S. and R. A. Gopinath (2001). Gaussianization. Advances in Neural Information Processing Systems 13. T. K. Leen, T. G. Dietterich and V. Tresp, MIT Press: 821-827. Chow, Y. L., M. 0. Dunham, 0. A. Kimball, M. A. Krasner, et al. (1987). BYBLOS: The BBN continuous speech recognition system. Proc. IEEE ICASSP'87, Dallas, USA. Cooke, M. (1993). Modelling Auditory Processing and Organization. U.K., Cambridge University Press. Cosi, P., Ed. (1999). Auditory modeling and neural networks. Speech Processing, Recognition and Artificial Neural Networks. London, Springer- Verlag. Cover, T. M. (1977). "On the possible ordering in the measurement selection problem." IEEE Trans. on Systems, Man, and Cybernetics 7(9): 657-661. Crochiere, R. E. and L. Rabiner (1983). Multirate Digital Signal Processing. Englewood Cliffs, NJ, Prentice Hall. Das, S., R. Bakis, A. Nadas, D. Nahamoo, et al. (1993). Influence of background noise and microphone on the performance of the IBM TANGORA speech recognition system. Proc. IEEE ICASSP'93. Das, S., A. Nadas, D. Nahamoo and M. Picheny (1994). Adaptation techniques for ambience and microphone compensation in the IBM Tangora speech recognition system. Proc. IEEE ICASSP'94, Adelaide, Australia. Daubechies, I. (1990). "The wavelet transform, time-frequency localization and signal analysis." IEEE Trans. IT 36(5): 961 -1005. Daubechies, I., Ed. (1992a). Ten Lectures on Wavelets. CBMS-NSF Regional Conference Series in Applied Mathematics. Philadelphia, Pennsylvania, SIAM Press. Daubechies, I. (1992b). Ten lectures on wavelets, SIAM. Davis, B. and P. Mermelstein (1980). "Comparison of parametric representations for monosylabic word recognirion in continuously spoken sentences." IEEE Trans. ASSP 28(4): 357-366. Davis, S. B., Ed. (1990). Comparison of parametric representations for monosyllabic word recognition in continuous spoken sentences. Readings in Speech Recognition. de-Boer, E. and H. R. de Jongh (1978). "On cochlea encoding: Potentialities and limitations of the reverse-correlation technique." J. Acoust. Soc. Am. 63(1): 115- 135. de-Boer, E. and P. Kuyper (1968). "Triggered correlation." IEEE Trans. Biomed. Eng. BME-15: 169 - 179. DeIler, J. R., J. G. Proakis and J. H. Hansen (1993). Discrete-Time Processing of Speech Signals. New York, Macmillan Publishing. Dempster, A. P., N. M. Laird and D. B. Rubin (1977). "Maximum likelihood from incomplete data via the EM algorithm." J. Royal Statist. Soc. Ser. B 39(1): 1-38. Devijver, P. A. and J. Kittler (1982). Pattern Recognition: A Statistical Approach. Englewood Cliffs, NJ, Prentice-Hall. Donoho, D. L., Ed. (1993). Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. Proceedings of the Symposia in Applied Mathematics, American Mathematical Society. Donoho, D. L. (1995). "Denoising by soft-thresholding." IEEE Trans. IT 41(3): 613-627. Donoho, D. L. and I. M. Johnstone (1994). "Ideal spatial adaptation by wavelet shrinkage." Biometrika 81: 425-455. Draper, N. R. and H. Smith (1981). Applied Regression Analysis. New York, Wiley. Duda, R. 0. and P. E. Hart (1973). Pattern Classification and Scene Analysis. New York, Wiley. Duncan Luce, R. (1993). Sound & Hearing A Conceptual Introduction, Lawrence Erlbaum Associates. Elenius, K. and M. Blomberg (1982). Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system. Proc. ICASSP'86. Ellermann, C., S. V. Even, C. Huang and L. Manganaro (1993). "Dragon systems' experiences in small to large vocabulary multi-lingual speech recognition applications." Proc. Eurospeech 3: 2077-2080. Ellis, D. (2001). TANDEM acoustic modeling in large-vocabulary recognition. Proc. ICASSP'2001, Salt Lake City. Ellis, D. and J. A. Bilmes (2000). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P. (2000a). Improved recognition by combining different features and different systems. Proc. AVIOS-2000, San Jose. Ellis, D. P. (2000b). Using mutual information to design feature combinations. Proc. ICSLP-2000, Beijing. Ellis, D. P., R. Singh and S. Sivadas (2001). Tandem acoustic modelling in large-vocabulary recognition. Proc. IEEE ICASSP'2001, Salt Lake City. Elman, J. L. (1990). "Finding structure in time." Cognitive Science 14(2): 179-211. Ferguson, J. D. (1980). Hidden Markov Analysis: An Introduction. Princeton, NJ, Institute of Defence Analyses. Fermin, C. D. (2000). Very Detailed Tutorial on the ear, http://www1.omi.tulane.edu/departments/pathology/fermin/Hearing.html. 6000. Forney, G. D. (1973). "The Viterbi algorithm." Proc. IEEE 61: 268-278. Furui, S. (1986a). Speaker independent isolated word recognition based on emphasised spectral dynamics. Proc. ICASSP'86, Tokyo - Japan. Furui, S. (1986b). Speaker independent isolated word recognition based on emphasized spectral dynamics. Proc. IEEE ICASSP'86, Tokyo- Japan. Furui, S. (1986c), "Speaker independent isolated word recognition using dynamic features of speech recognition." IEEE Trans. ASSP 34(2): 52-59. Furui, S. (1986d). "Speaker-independent isolated word recognition using dynamic features of speech spectrum." IEEE Trans. ASSP 34: 52-59. Furui, S. (1988). "A VO-based preprocesor using cepstral dynamic features for speaker-independent large vocabulary word recognition." IEEE Trans. ASSN 36(7): 980-987. Garofolo, J. S., L. F. Lame!, W. M. Fisher, J. G. Fiscus, et al. (1990). DARPA TIM IT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. NISTIR 4930. Gauvain, J.-L. and C.-H. Lee (1994). "Maximum a posteriori estimation for multivariate gaussian mixture observations of Markov chains." IEEE Trans. SAP 2(2): 291-298. Ghitza, 0., Ed. (1992). Auditory nerve representation as a basis for speech processing. Advances in Speech Signal Processing. New York, Marcel Dekker. Glasberg, B. R. and B. C. Moore (1990). "Derivation of auditory filter shapes from notched-noise data." Hearing Research 47: 103 - 108. Glinski, S. C. (1985). "On the use of vector quantization for connecteddigit recognition." The AT & T Tech, J. 64(5): 1033-1045. Gong, Y. (1995). "Speech recognition in noisy environments: A survey." Speech Communication 16: 261-291. Graps, A. (1995). "An introduction to wavelets." IEEE Computational Science and Engineering 2(2). Gravier, G., M. Sigelle and G. Chollet (1998). "Marrkov random field modelling for speech recognition." Australian J. of Intelligent Information Processing Systems 5(4): 245-251. Gray, J. and J. D. Markel (1976). "Distance measures for speech processing." IEEE Trans. ASSP 24(5): 380-391. Gray Jr, A. H. and J. D. Markel (1974). "A spectral flatness measure for studying the autocorrelation method of linear prediction of speech analysis." IEEE Trans. ASSP ASSP-22: 607 - 217. Gray, R. M. (1984). "Vector quantization." IEEE ASSP Magazine 1(2): 4 - 29. Gray, R. M., A. Buzo, J. Gray and Matsuyama (1980). "Distortion measures for speech processing." IEEE Trans. ASSP 68(4): 367-376. Greenberg, S., T. Arai and R. Silipo (1998). Speech derivation from exceedingly sparse spectral information. Proc. IEEE ICSLP'98. Greenwood, D. (1961). "Critical bandwidth and the frequency coordinates of the basilar membrane." J. Acoust. Soc. Am. 33: 1344 - 1356. Greenwood, D. (1990). "A cochlear frequency-position function for several species--29 years later." J. Accost, Soc. Am, 87(6): 2592 - 2605. Gupta, V. N., M. Lennig and P. Mermelstein (1987a). Integration of acoustic information in a large vocabulary word recognizer. Proc. ICASSP'87, Dallas, USA. Gupta, V. N., M. Lenning and P. Mermelstein (1987b). Integration of acoustic information in a large vocabulary word recognizer. Proc. IEEE ICASSP'87. Haeb-Umbach, R. (1999a). Investigations on interspeaker variability in the feature space. Proc. ICASSP'99, Phoenix, Arizona. Haeb-Umbach, R. (1999b). Investigations on inter-speaker variability in the feature space. Proc. IEEE ICASSP'99, Arizona-USA. Hansen, J. H. and B. L. Pellom (1998). An effective quality evaluation protocol for speech enhancement algorithms. Proc. ICSLP'98, Sydney, Australia. Hanson, B. A. and T. H. Applebaum (1990a). Features for noiserobust speaker-independent word recognition. Proc. Int. Conf. Spoken Language Processing (ICSLP), Kobe-Japan. Hanson, B. A. and T. H. Applebaum (1990b). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. IEEE ICASSP'90, Albuquerque, NM. Hanson, B. A. and T. H. Applebaum (1990c). Robust speaker independent word recognition using static, dynamic, and acceleration features: experiments with lombard and noisy speech. Proc. ICASSP'90, Albuquerque, NM. Hanson, B. A., T. H. Applebaum and J.-C. Junqua (1996a). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K. Soong and K. K. Paliwal. Hanson, B. A., T. H. Applebaum and J.-C. Junqua, Eds. (1996b). Spectral dynamics for speech recognition under adverse conditions. Automatic Speech and Speaker Recognition, Kluwer Academic Publishers. Harborg, E. (1990). Hidden Markov Models Applied to Automatic Speech Recognition,. PhD Thesis, Norwegian Institute of Technology (Trondheim). Harrington, J. and S. Cassidy (1999). Techniques in Speech Acoustics, Kluwer Academic Publishers. Hartmann, W. M. (1998). Signals, Sound, and Sensation, Springer- Verlag. Herrnansky, H. (1990a). "Perceptual linear predictive (PLP) analysis for speech." J. Acoust. Soc. Am. 87: 1738-1752. Hermansky, H. (1990b). "Perceptual linear predictive (PLP) analysis of speech." J. Acoust. Soc. Am. 87(4): 1738-1752. Hermansky, H. (1997). Should recognizers have ears? Proc. ESCA Tutorial and Research Workshop on Robust Speech Recognition for Unknown Communication Channels, France. Hermansky, H. (1999). Analysis in automatic recognition of speech. Speech Processing, Recognition and Artificial Neural Networks. G. Chollet, D. Di Benedetto, A. Esposito and M. Marinaro, Springer-Verlag: 115-137. Hermansky, H., D. Ellis and S. Sharma (2000). TANDEM connectionist feature extraction for conventional HMM systems. Proc. ICASSP2000, Istanbul. Hermansky, H. and N. Malayath (1998). Spectral basis functions from discriminant analysis. ICSLP'98, Sydney, Australia. Hermansky, H. and S. Sharma (1999). Temporal patterns (TRAPS) in ASR of noisy speech. Proc. ICASSP'99, Phoenix, AZ. Hertz, J., A. Krogh and R G. Palmer (1991). Introduction To The Theory Of Neural Computing, Addison-Wesley Publishing Company. Hess, W. (1983). Pitch determination of speech signals. New York, Springer-Verlag. Hochberg, M., S. Rentals, A. J. Robinson and G. D. Cook (1995). Recent improvements to the ABBOT large vocabulary CSR system. Proc. IEEE ICASSP'95. Huang, L.-S. and C.-h. Yang (2000). A novel approach to robust speech endpoint detection in car environment. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Huang, X., A. Acero, F. Allova, M. Y. Hwang, et al. (1995). "Microsoft windows highly intelligent speech recognizer: Whisper." Proc. IEEE ICASSP'95 1: 93-97. Huang, X. D. (1992). Minimizing speaker variation effects for speaker-independent speech recognition. Proceedings of Speech and Natural Language Workshop. Huang, X. D., M. A. Ariki and M. A. Jack (1990a). Hidden Markov Models for Speech Recognition. Edinburgh, Edinburgh University Press. Huang, X. D., Y. Ariki and M. A. Jack (1990b). Hidden Markov Models for Speech Recognition, Edinburgh University Press. Huang, X. D., H. W. Hon, M. Y. Huang and K. F. Lee (1993). "A comparative study of discrete semi-continuous, and continuous hidden Markov models." Computer Speech and Language 7: 359-368. Huang, X. D. and M. A. Jack, Eds. (1990). Semi-continuous hidden Markov models for speech signals. Readings in Speech Recognition, Morgan Kaufmann. Huang, X. D., K.-F. Lee, H. W. Hon and M. Y. Hwang (1991). Improved acoustic modeling for the SPHINX speech recognition system. Proc. IEEE ICASSP'91, Toronto, Canada. Humphries, J. J. and P. C. Woodland (1997). Using accent-specific pronounciation modelling for improved large vocabulary continuous speech recognition. Proc. Eurospeech. Hunt, M. J., S. M. Richardson, D. C. Bateman and A. Piau (1991). An investigation of PLP and IMELDA acoustic representations and of their potential for combination. Proc. IEEE ICASSP'91, Toronto, Canada. Huo, Q., C. Chan and C.-H. Lee (1995). "Bayesian adaptive learning of the parameters of hidden Markov models for speech recognition." IEEE Trans. SAP 3(5): 334-345. Hwang, M. Y. (1993). Subphonetic acoustic modeling for speaker independent continuous speech recognition, CMU. Hwang, M. Y. and X. D. Huang (1992). Subphonetic modeling with Markov states-senone. Proc. IEEE ICASSP'92. Itakura, F. (1975a). "Minimum prediction residual principal applied to speech recognition." IEEE Trans. ASSP 23(1): 67-72. Itakura, F. (1975b). "Minimum prediction residual principle applied to speech recognition." IEEE Trans. ASSP 23(1): 76-72. Janin, A., D. Ellis and N. Morgan (1999a). Multi-stream speech recognition: Ready for prime time? Proc. Eurospeech'99, Budapest. Janin, A., D. P. Ellis and N. Morgan (1999b). Multi-stream speech recognition ready for prime time? Proc. Eurospeech, Budapest. Jankowski Jr., C. R, H.-D. H. Vo and A. P. Lippmann (1995). "A comparison of signal processing front ends for automatic word recognition." IEEE Trans. Speech and Audio Processing 3(4): 286 - 293. Jankowski-jr, C. R. (1995). "A comparison of signal processing front ends for automatic word recognition." IEEE Trans. SAP 3(4): 286-293. Jayant, N. 0. S. and P. Noll (1984). Digital coding of waveforms, Prentice Hall. Jelinek, F. (1976). "Continuous recognition by statistical methods." Proc. IEEE 64: 532-555. Jelinek, F. (1998). Statistical Methods for Speech Recognition, The MIT Press. Jelinek, F., L. R. Bahl and R. L. Mercer (1975). "Design of a linguistic statistical decoder for the recognition of continuous speech." IEEE Trans. on Information Theory 21: 250-256. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. Proceedings of The Eighth Annual Conference of the Cognitive Science Society, Amherst, MA. Jordan, M. I., Ed. (1989). Serial Order: A Parallel, Distributed Processing Approach. Hillsdale: Erlbaum. Juang, B. H. (1991). "Speech recognition in adverse environments." Computer Speech & Language 5: 275-294. Juang, B. H., L. R. Rabiner, S. E. Levinson and M. M. Sondhi (1985). Recent developments in the application of hidden Markov models to speaker independent isolated word recognition. Proc. IEEE ICASSP'85, Tampa. Kadambe, S. and G. F. Boudreaux-Bartels (1992). "Application of the wavelet transform for pitch detection of speech signals." IEEE Trans. Information Theory 38(2): 917 - 964. Kailath, T. (1967). "The divergence and Bhattacharyya distance measure in signal selection." IEEE Trans. COM 15: 52-60. Kasabov, N. K. (1996). Foundations of Neural Networks, Fuzzy Systems, and Knowledge Engineering. Cambridge, MIT Press. Kingsbury, B. E. D. (1999). Perceptually-inspired signal processing strategies for robust speech recognition in reverberant environments. PhD. Thesis, University of California, Berkeley, CA. Fullback, S. (1959). Information Theory and Statistics. New York, Dover. Lamel, L. F., L. R. Rabiner, A. E. Rosenberg and J. G. Wilpon (1981). "An improved end points detector for isolated word recognition." IEEE Trans. ASSP 29(4): 777-785. Lee, C.-H. and J.-L. Gauvain (1993). "Speaker adaptation based on MAP estimation of HMM parameters." Proc. IEEE ICASSP'93: 652-655. Lee, K.-F. (1989). Automatic Speech Recognition, Kluwer Academic Publishers. Lee, K.-F., H. W. Hon, M. Y. Hwang, S. Mahajan, et al. (1989). The Sphinx speech recognition system. Proc. IEEE ASSP'89, Glasgow. Levinson, S. E., L. R. Rabiner and M. M. Sondhi (1983). "An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition." Bell Sys. Tech. J. 62(4): 1035-1074. Linde, Y., A. Buzo and R. M. Gray (1980). "An algorithm for vector quantizer design." IEEE Trans. Commun. 28(1): 84-95. Liu, F., -H, R. M. Stern, A. Acero and P. J. Moreno (1994). Environment normalization for robust speech recognition using direct cepstral comparison. Proc. IEEE ICASSP'94, Adelaid, Australia, Lockwood, P., C. Baillargeat, J. M. Gillot, J. Boudy, et al. (1991). "Noise reduction for speech enhancement in cars: Non-linear spectral subtraction - Kalman filtering." Proc. Eurospeech. Lockwood, P. and J. Boudy (1992). "Experiments with a non-linear spectral subtractor (NSS), hidden Markov models and the projection, for robust speech recognition in cars." Speech Communication 11(2-3): 215- 228. Looney, C. G. (1997). Pattern Recognition Using Neural Networks, Oxford University Press. Lyon, R. F. (1997). All-pole models of auditory filtering. Diversity in auditory mechanics. Lewis. Singapore, World Scientific Publishing: 205 - 211. Makhoul, J. (1975). "Linear Prediction: A Tutorial Review." Proceedings of the IEEE 63(4): 561 - 580. Makhoul, J. and Cossel (1976). LPCW: An LPC vocoder with linear predictive spectral warping. Proc. ICASSP°76, Philadelphia, USA. Makhoul, J., S. Roucos and H. Gish (1985). "Vector quantization in speech coding." Proceedings of the IEEE 73(11): 1551-1588. Mallat, S. (1989). "A theory for multiresolution signal decomposition: the wavelet representation." IEEE Trans. PAMI 11(7): 674 - 693. Manning, C. D. (1999). Foundation of Statistical Natural Language Processing, The MIT Press. Markel, J. D. and A. H. Gray Jr. (1976). Linear prediction of speech, Springe-Verlag. McLachlan, G. J. and T. Krishnan (1997). The EM Algorithm and Extensions, John Wiley & Sons, Inc. Meilijson, I. (1989). "A fast improvement to the EM algorithm on its own terms." J. Royal Statist. Soc. Ser. B 51(1): 127-138. Mermelstein, P., Ed. (1976). Distance measures for speech recognition, psychological and instrumental. Pattern Recognition and Artificial Intelligence. Ming, J. and F. J. Smith (2000). A probabilistic union model for subband based robust speech recognition. Proc. IEEE ICASSP2000, Istanbul, Turkey. Misiti, M., Y. Misiti, G. Oppenheim and J. M. Poggi (1996). Wavelet Toolbox, Math Works Inc. Mokbel, C. and G. Collet (1991). Speech recognition in adverse environments: speech enhancement and spectral transformations. Proc. IEEE ICASSP'91, Toronto, Canada. Moon, T. K. (1996). "The expectation maximization algorithm." Signal Processing 13(6): 47-60. Moon, T. K. and W. C. Stirling (2000). Mathematical Methods and Algorithms for Signal Processing, Prentice-Hall, inc. Moore, B. C. (1977). Introduction to the Psychology of Hearing. Baltimore, Md., University Park Press. Moore, B. C. J. and B. R. Glasberg (1983). "Suggested formulae for calculating auditory filter bandwidths and excitation patterns." J. Acoust. Soc. Am. 74(3): 750 - 753. Morgan, N. and H. Bourlard (1995). "Continuous speech recognition: an introduction to the hybrid HMM/connectionist approach." Signal Processing Magazine May: 25-42, Murphy, K. M. (1996). Sound and auditory system lecture, http://www.science.mcmastenca/Psychology/psych2e03/lecture9/sound.au dsys.lectu re. html. 2000. Myers, C., L. R. Rabiner and A. E. Rosenberg (1980). "Performance tradeoffs in dynamic time warping algorithms for isolated word recognition." IEEE Trans. ASSP 28(6): 623-635. Nadas, A. (1983). "A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood." IEEE Trans. ASSP 31: 814-817. Ney, H. (1990). Experiments on mixture-density phoneme modeling for the speaker of independent 1000-word speech recognition DARPA task. Proc. IEEE ICASSP'90, Albuquerque, NM. Nguyen, L. and R. M. Schwartz (1997). Efficient 2-pass N-best decoder. Proc. Eurospeech. Normandin, Y. (1996). Maximum mutual information estimation of hidden Markov models. Automatic Speech and Speaker Recognition Advanced Topics. C.-H. Lee, F. K. Soong and K. K. Paliwal. Oppenheim, A. V. and R. W. Shafer (1989). Digital signal processing, Prentice Hall. O'Shaughnessy, D. (1987). Speech communication: human and machines. New York, USA, Addison Wesely. Owens, F. J. (1993a). Signal processing of speech, Macmillan. Owens, F. J. (1993b). Signal Processing of Speech, Macmillan Press Ltd., London. Padmanabhan, M., L. R. Bahl, D. Nahamoo and P. V. de Souza (1997). Decision tree based quantization of the feature space of a speech recognizer. Proc. Eurospeech. Paliwal, K. K. (1992). "Dimensionality reduction of the enhanced feature set for the HMM-based speech recognizer." Digital Signal Processing 2: 157-173. Papamichalis, P. (1987). Practical approaches to speech coding. New Jersey, USA, Prentice-Hall, Inc., Englewood Cliffs. Patterson, R. D. (1976). "Auditory filter shapes derived with noise stimuli." J. Acoust. Soc. Am. 59: 640 - 654. Patterson, R D. (1994). "The sound of a sinusoid: Spectral models." J. Acoust. Soc. Am. 96(3): 1409 - 1418. Patterson, R. D., M. H. Allerhand and C. Giguere (1995). "Timedomain modelling of peripheral auditory processing: A modular architecture and a software platform." J. Acoust. Soc. Am. 98: 1890-1894, Patterson, R. D., I. Nimmo-Smith, D. L. Weber and R. Milroy (1982). "The deterioration of hearing with age: Frequency selectivity, the critical ratio, the audiogram and speech threshold." J. Acoust, Soc. Am. 72: 1788 - 1803. Paul, D. B. (1990). "Speech recognition using hidden Markov models." The Lincoln Laboratory Journal 3: 41-62. Pickles, J. 0. (1988). An introduction to the physiology of hearing. New York, Academic Press. Picone, J. W. (1993). "Signal modeling techniques in speech recognition." Proceedings of the IEEE 81(9): 1215 - 1247. Polikar, R. (1999). The wavelet tutorial, http://www.public.iastate.edu/-rpolikar/WAVELETS. Poritz, A. B. (1988). "Hidden Markov models: A guided tour." Proc. ICASSP'88 1: 7-13. Poritz, A. B. and A. G. Richter (1986). On hidden Markov models in isolated word recogntion. Proc. IEEE ICASSP'86, Tokyo. Price, R. (1958). "A useful theorem for nonlinear devices having Gaussian inputs." IRE Trans. Inf. Theory IT-4: 69 - 72. Pruzansky, S. (1964). "Talker recognition procedure based on analysis of variance." J. Acoust. Soc. Am. 36: 2041-2047. Purvis, M. (2001). Performance improvement due to multi-streaming. Information Science Deptartment, University of Otago, New Zealand (Personal Communication). Quackenbush, S. R., T. P. Barnwell III and M. A. Clements (1988a). Objective Measures of Speech Quality, Prentice-Hall, Englewood Cliffs, NJ. Quackenbush, S. R., T. P. Barnwell-III and M. A. Clements (1988b). Objective Measures of Speech Quality. NJ, Prentice Hall. Rabiner, L. (1977). "On the use of autocorrelation analysis for pitch detection." IEEE Trans. ASSP 25(1): 24 - 33. Rabiner, L., M. J. Cheng and A. E. Rosenberg (1976). "Acomparative performance study of several pitch detection algorithms." IEEE Trans. ASSP 24(5): 399 - 417. Rabiner, L. and B.-H. Juang (1993a). Fundamentals of speech recognition. New Jersey, Prentice Hall. Rabiner, L. R. (1989). "A tutorial on hidden Markov models and selected applications in speech recognition." Proc. IEEE 77(2): 257-286. Rabiner, L. R. and B. H. Juang (1986). "An introduction to hidden Markov models." IEEE ASSP Magazine 3(1): 4-16. Rabiner, L. R. and B. H. Juang (1993b). Fundamentals of Speech Recognition, Prentice Hall, Englewood Cliffs, New Jersey. Rabiner, L. R. and B.-H. Juang (1993c). Fundamentals of Speech Recognition. Englewood Cliffs, NJ, Prentice Hall. Rabiner, L. R., B. H. Juang, S. E. Levinson and M. M. Sondhi (1985a). "Recognition of isolated digits using hidden Markov models with continuous mixture densities." AT & T Technical Journal 64(6): 1211-1234. Rabiner, L. R., B. H. Juang, S. E. Levinson and M. M. Sondhi (1985b). "Recognition of isolated digits using hidden Markov models with continuous mixture densities." The AT & T Tech. J. 64(6): 1211-1233. Rabiner, L. R., S. E. Levinson and M. M. Sondhi (1983). "On the application of vector quantization and hidden Markov models to speakerindependent isolated word recognition." The Bell System Technical Journal 62(4): 1075-1105. Rabiner, L. R., K. C. Pan and S. F. K. (1984a). "On the performance of isolated word speech recognizers using vector quantization and temporal energy contours." The AT & T Tech. J. 63(7): 1245-1260. Rabiner, L. R. and M. R. Sambur (1975). "An algorithm for determining the endpoints of isolated utterances." Bell Syst. Tech. J. 54(2): 297-315. Rabiner, L. R., M. M. Sondhi and S. E. Levinson (1984b). "A vector quantizer combining energy and LPC parameters and its application to isolated word recognition." The AT & T Tech. J. 63(5): 721-735. Rabiner, L. R., J. G. Wilpon and B. H. Juang (1986). "A model-based connected-digit recognition system using either hidden Markov models or templates." Computer Speech & Language 1(1): 167-197. Ravishankar, M., R. Bisiani and E. Thayer (1997). "Sub-vector clustering to improve memory and speech performance of acoustic likelihood computation." Proc. Eurospeech 151-154. Rumelhart, D. E., G. E. Hinton and R. J. Williams, Eds. (1986). Learning Internal Representation by Error Propagation. Parallel Distributed Processing: Exploration in the Microstructures of Cognition. Cambridge, MA, MT Press. Saito, S. and K. Nakata (1985). Fundamentals of speech signal processing, Academic Press. Sakoe, H. and S. Chiba (1978a). "Dynamic programming algorithm optimization for spoken word recognition." IEEE Trans. ASSP 26: 43-49. Sakoe, H. and S. Chiba (1978b). "Dynamic programming algorithm optimization for spoken word recognition." IEEE Trans. ASSP 26(1): 43-49. Sambur, M. R. (1975). "Selection of acoustic features for speaker identification." IEEE Trans. ASSP 23: 176-182. Schalkoff, R. J. (1992). Pattern Recognition: Statistical, Structural and Neural Approaches, John Wiley & Sons, Inc. Schroeder, M. R. (1999). Computer Speech: Recognition, Compression, Syntehsis, Springer-Verlag. Scott, D. W. (1992). Multivariate Density Estimation. New York, Wiley. Seneff, S. (1986). A computational model for the peripheral auditory system: application to speech recognition research. Proc. IEEE ICASSP'86. Sharma, S., D. P. Ellis, S. Kajarekar, P. Jain, eta!. (2000). Feature extraction using non-linear transformation for robust speech recognition on the Aurora database. Proc. IEEE ICASSP'2000, Istanbule. Shin, W.-H., B.-S. Lee, Y.-K. Lee and J.-S. Lee (2000), Speech/nonspeech classification usin multiple features for robust endpoint detection. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis, Chapman and Hall. Slaney, M. (1988). Lyon°s cochlear model, Apple Computer Inc. Slaney, M. (1993). An efficient implementation of the Patterson- Holdsworth auditory filter bank, Apple Computer Inc. Slaney, M. and R. F. Lyon (1990). A perceptual pitch detector. ICASSP°1990. Smervuo, P. (1996). Speech recognition using context vectors and multiple feature streams. MSc Thesis, Helsinki University of Technology. Smith, L. S., Ed. (1997). Extracting features from the short-term time structure of cochlear filtered sound. Proceedings of the Fourth Neural Computation and Psycology Workshop: Connectionist Representations. London, Springer-Verlag. Solbach, L. (1998). An architecture for robust partial tracking and onset localization in single channel audio signal mixes. PhD thesis, Technical University of Hamburg-Harburg (Germany). Strange, G. and T. Nguyen (1996). Wavelets and Filter Banks. Wellesley - MA, Wellesley - Cabridge Press. Tchorz, J., K. Kasper, H. Reininger and B. Kollmeier (1997). On the interplay between auditory-based and locally recurrent neural networks for robust speech recognition in noise. Proc. Eurospeech°97, ESCA, Patras, Greece. Tchorz, J., M. Kleinschmidt and B. Kollmeier, Eds. (2001). Noise suppression based on neurophysiologically-motivated SNR estimation for robust speech recognition. Advances in Neural Information Processing Systems - 13, MIT Press. Tchorz, J. and B. Kollmeier (1999). A psychoacoustical model of the auditory periphery as front end for ASR. Proc. ASA/EAA/DEGA Joint Meeting on Acoustics, Berlin. Tibrewala, S. and H. Hermansky (1997). Multi-stream approach in acoustic modelling. Proc. LVCSR-Hub5 Workshop. Tou, J. T. and R. C. Gonzalez (1974). Pattern Recognition Principles, Addison-Wesley Publishing Company. Trentin, E. (2001). Robust Combination of Neural Networks and Hidden Markov Models for Speech Recognition. PhD Thesis, University of Florence (Italy). Tribolet, J. M., L. R. Rabiner and M. M. Sondhi (1979). "Statistical properties of an LPC distance measure." IEEE Trans. ASSP 27(5): 550- 558. Tsakalidis, S., V. Digalakis and L. Neumeyer (1999). Efficient speech recognition using subvector quantization and discrete-mixture HMMs. Proc. IEEE ICASSP'99, Phoenix, Arizona. Urnesh, S., L. Cohen and D. Nelson (1999). Fitting the Mel scale. Proc. IEEE ICASSP'99, Phoenix, Arizona. Valens, C. (1999). A realy friendly guide to wavelets, http://perso.wanadoo.fr/polyvalens/clemens/clemens.html. Valtchev, V. (1995). Discriminative Methods in HMM-based Speech Recognition. PhD Thesis, University of Cambridge. Wand, M. P. and M. C. Jones (1995). Kernel Smoothing, Chapman & Hall. Warren, R. M. (1982). Auditory Perception: A New Synthesis, Pergamon Press. Wilpon, J. G. and L. R. Rabiner (1985). "A modified K-means clustering algorithm for use in isolated word recognition." IEEE Trans. ASSP 33(3): 587-594. Wilpon, J. G., L. R. Rabiner and T. B. Martin (1984). "An improved word -detection algorithm for telephone quality speech incorporating both syntactic and semantic constraints." AT & T Tech. J. 63(3): 479-498. Wolf, J. J. (1972). "Efficient acoustic parameters for speaker recognition." J. Acoust. Soc. Am. 51: 2044-2056. Wolpert, D. H. (1992). "Stacked generalization." Neural Networks 5(2): 241-259. Woodland, P. C., M. J. F. Gales and D. Pye (1996). Improving environmental robustness in large vocabulary speech recognition. Proc. ICASSP'96. Wu, C. F. J. (1983). "On the convergence properties of the EM algorithm." The Annals of Statististics 11(1): 95-103. Wu, S., B. Kingsbury, N. Morgan and S. Greenberg (1998). Performance improvements through combining phone- and syllable-length information in automatic speech recognition. Proc. ICSLP-98, Sydney. Young, S. (1996). "Large vocabulary continuous speech recognition: a review.". Young, S., D. Kershaw, J. Odell, D. 011ason, et al. (2000). The HTK Book (version 3.0), Microsoft Corporation. Zhu, Q. and A. Alwan (2000). On the use of variable frame rate analysis in speech recognition. Proc. IEEE ICASSP'2000, Istanbul, Turkey. Zweig, G. and S. Russell (1998). "Probabilistic modeling with Bayesian networks for automatic speech recognition." Australian J. of Intelligent Information Processing Systems 54(4): 253-259. Zwicker, E. (1961). "Subdivision of the audible frequency range into critical bands." J. Acoust. Soc. Am. 33: 248.||en_NZ|
Files in this item
There are no files associated with this item.
This item is not available in full-text via OUR Archive.
If you would like to read this item, please apply for an inter-library loan from the University of Otago via your local library.
If you are the author of this item, please contact us if you wish to discuss making the full text publicly available.