Top automatic speech recognition (ASR) methodologies

Speech Recognition

For human beings, speech is the most natural mode of communication. Speech recognition is the process of using a computer program to convert speech into a sequence of words. Speech recognition software allows users to interact with applications more easily and effectively using speech as an additional input mode.

Automatic speech recognition (ASR) has advanced to the point where more challenging applications are becoming a reality, thanks to the exponential growth of big data and computing power. Voice search and interactions with mobile devices (e.g., Siri on iPhone, Bing voice search on Windows Phone, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on Xbox), and speech-centric information processing applications based on ASR outputs are examples.

These typical applications include dictation systems, voice user interfaces, voice dialing, call routing, domestic appliance control, command and control, voice-enabled search, simple data entry, hands and eyes-free applications, and learning systems for disabled people.

ASR converts a speech signal to a sequence of words (i.e., spoken words to text) utilizing an algorithm implemented as a computer program. This post will explore the popular ASR methodologies that revolutionized speech recognition in recent years.

Acoustic-phonetic approach

The acoustic-phonetic approach is based on acoustic phonetics, which states that spoken language has finite, distinct phonetic units. The acoustic properties of the phonetic units are manifested in the speech signal, or its spectrum, over time. The acoustic-phonetic approach begins with a spectral analysis of the speech, followed by feature detection, which converts the spectral measurements into features that describe the various phonetic units’ broad acoustic properties. The segmentation and labeling phase comes next. The speech signal is segmented into stable acoustic regions, and each segmented region is assigned one or more phonetic labels, yielding a phoneme lattice characterization of the speech. From the phonetic label sequences produced by segmentation to labeling, the final step in this approach attempts to determine a valid string of words.

Pattern recognition approach

Pattern training and pattern comparison are two crucial steps in the pattern-matching process. In the pattern-comparison stage of the approach, the unknown speeches are directly compared to each possible pattern learned in the training stage to determine the identity of the unknown based on the patterns’ goodness of match. This approach employs a well-defined mathematical framework to create consistent speech pattern representations from a set of labeled training samples via a formal training algorithm, allowing for reliable pattern comparison. A speech pattern representation can be found in a speech template or a statistical model (e.g., Hidden Markov Model), and it can be used to represent a smaller sound than a word or a phrase. In the last six decades, pattern-matching has become the most popular method for speech recognition.

Artificial intelligence approach

The acoustic-phonetic and pattern recognition approaches are combined in the Artificial Intelligence approach. This makes use of acoustic-phonetic and pattern recognition ideas and concepts. In ASR, there are two main approaches to pattern matching: deterministic pattern matching using dynamic time warping (DTW) and stochastic pattern matching using hidden Markov models (HMMs).

One or more templates represent each class that must be recognized in DTW. To improve the pronunciation/speaker variability modeling, using more than one reference template per class may be preferable. A distance between an observed speech sequence and class patterns is calculated during recognition. Stretched and warped versions of the reference patterns are also used in the distance calculation to eliminate the impact of the duration mismatch between test and reference patterns. The recognized word refers to the path through the model that reduces the total distance traveled. Increasing the number of class pattern variants and loosening warping constraints may improve DTW-based recognition performance at the expense of storage space and computational demands. HMM-based pattern matching is preferred over DTW in modern systems due to better generalization properties and lower memory requirements.

Generative learning approach

Hidden Markov models based on Gaussian-MixtureModels are the most common generative learning approach in ASR. Speech recognition systems that have been around for a long time rely on. A Gaussian mixture model (GMM) is represented by the sequential structure of speech signals based on hidden Markov models (HMMs). Since a speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal, HMMs are used in speech recognition. Speech can be approximated as a stationary process on a short time scale. For many stochastic purposes, speech can be thought of as a Markov model. Typically, each HMM state models a spectral representation of the sound wave using a Gaussian mixture.

HMMs are so popular because they can easily handle variable-length data sequences caused by changes in word order, speaker rate, and accent. Even though the HMM-GMM approach has become the industry standard in ASR, it has its benefits and drawbacks. Automatically trainable HMM-based speech recognition systems are simple and computationally feasible. However, one of the most significant disadvantages of Gaussian mixture models is that they are statistically inefficient for modeling data on or near a nonlinear manifold in the data space.

Discriminative learning

Using a discriminative model or applying discriminative training to a generative model is the paradigm of discriminative learning. In the 1990s, it was fashionable to use neural networks in the form of Multilayer Perceptron (MLP) with the softmax nonlinear function at the final layer. When the MLP output is fed into an HMM, a good discriminative sequence model, or hybrid MLP-HMM, can be created because the output can be interpreted as conditional probability.

This line of research has been switched to a new direction where the MLP simply produces a subset of feature vectors in combination with the traditional features for use in the generative HMM, owing to the difficulty in learning MLPs. In the late 1980s, neural networks trained with back-propagation error derivatives became a popular acoustic modeling approach for speech recognition. Unlike HMMs, neural networks make no assumptions about the statistical properties of features.

When used to estimate the probabilities of a speech feature segment, neural networks allow for natural and efficient discriminative training; however, despite their success in classifying short-time units like individual phones and isolated words, neural networks are rarely successful for continuous recognition tasks, owing to their inability to model temporal dependencies. Many simple or well-constrained problems have been successfully solved using shallow architectures. However, when dealing with more complex real-world applications involving human speech, their limited modeling and representational power can cause problems. As an alternative, neural networks can be used as a pre-processing tool for HMM-based recognition, such as feature transformation and dimensionality reduction.

Deep learning

Deep learning, also known as unsupervised feature learning or representation learning, is a relatively new branch of machine learning. Deep learning is quickly becoming a standard technology for speech recognition, having successfully replaced Gaussian mixtures for speech recognition and feature coding on a large scale. Deep generative architectures of the first type characterize the data’s high-order correlation properties or joint statistical distributions of visible data and their associated classes. The Bayes rule can be used to make this type of architecture discriminative. Deep auto-encoders, deep Boltzmann machines, sum-product networks, the original Deep Belief Network (DBN), and its extension to the factored higher-order Boltzmann machine in its bottom layer are examples of this type.

Deep discriminative architecture is the second type of deep architecture. It provides discriminative power for pattern classification by characterizing the posterior distributions of class labels conditioned on visible data. Deep-structured CRF, tandem-MLP architecture, deep convex or stacking network and tensor version, and detection-based ASR architecture are all examples. The goal of the third type, known as deep hybrid architectures, is discrimination, which is aided by the results of generative architectures. The generative component is primarily used to aid discrimination, which is the hybrid architecture’s ultimate goal.