Speech recognition Engine (STT)
SPEECH RECOGNITION ENGINE STT
Speech Recognition Engine
Speech Recognition refers to the process of a computer interpreting a speech uttered by a person and converting it into text data and is also being called STT (Speech-to-Text).
The Speech recognition engine is a system that provides speech recognition services for use in various services based on speech interface.
In particular, AI Suite’s speech recognition engine has been pre-trained with a large amount of data, and the transfer learning method used for quick application to a specific domain makes it possible to provide high-quality services with only a small amount of data.
< Dialogue interface-based service construction >
Deep neural network-based speech recognition learning
< Deep neural network-based speech recognition learning summary >
Large-capacity multilingual speech DB learning
Main functions and specifications
Speech recognition engine can be divided into RESTful-based speech recognition services and learning management features of sound models and language models. Speech recognition service provides speech recognition results by preprocessing input speech data, feature extraction, text conversion through model, and result correction. Learning management is a learning data of speech-text, and performs learning regarding acoustic models and language models.
< Speech recognition engine block diagram >
Speech recognition service
Speech recognition engine provides its speech recognition feature through RESTful API end point service.
For services that utilize this application, this is accessible regardless of the system environment, and the features provided here allow various speech recognition-based AI services to be implemented.
Acoustic model adaptive learning
Language model learning
Providing high-quality speech recognition model
The following table shows the results of adaptive learning-based speech recognition quality evaluation. Corr (Correct) is the number of correctly recognized syllable units, Acc (Accuracy), the number of correct answers given insertion and deletion errors, H (hit), the number of correctly recognized words, D (deletion), the number of cases of recognized as silence, S (substitution), the number of cases in which it was recognized as a different word, and I (insertion) is the number of cases in which silence was recognized as another syllable. In the case of the baseline prior to adaptive learning, both the acoustic and language models had a correct answer rate of 70% or less, but after adaptive learning, the rate rose to 97% for both models. The developed speech recognition technology is utilized in various environments, such as speech recognition, text analysis, and building of call bot systems in chatbots and call centers, etc.
< Quality evaluation of adaptive learning-based speech recognition technology >