Speech recognition Engine


    Speech recognition Engine STT

    Speech Recognition refers to the process of a computer interpreting a speech uttered by a person and converting it into text data, and is also being called as STT (Speech-to-Text).
    The Speech recognition engine is a system that provides speech recognition services for use in various services based on speech interface.
    In particular, AI Suite’s speech recognition engine has been pre-trained with a large amount of data, and the transfer learning method used for quick application to a specific domain makes it possible to provide high quality services with only a small amount of data.

    ai06 (1)

    < Dialogue interface-based service construction >

    Main Features

    Deep neural network-based speech recognition learning
    AI Suite's speech recognition engine is based on acoustic model adaptive learning sophisticated by Deep Learning. It is based on a baseline acoustic model with Long Short-Term Memory (LSTM) technology, which is more advanced than the commonly used HMM (Hidden Markov Model) or conventional Fully connected DNN-based (Deep Neural Network) acoustic model.
    ai07 (1)

    < Deep neural network-based speech recognition learning summary >

    Large-capacity multilingual speech DB learning
    Saltlux has its own multi-speaker speech data for different situations for various languages. The AI Suite's speech recognition engine is built with multilingual speech recognition and high quality basic speech recognition models trained with these multilingual speech data. This makes it possible to provide high-quality speech recognition services.

    Main features and specifications

    Speech recognition engine can be divided into RESTful-based speech recognition services and learning management features of sound models and language models. Speech recognition service provides speech recognition results by preprocessing input speech data, feature extraction, text conversion through model, and result correction. Learning management is a learning data of speech-text, and performs learning regarding acoustic models and language models.

    ai08 (1)

    < Speech recognition engine block diagram >

    Speech recognition service
    Speech recognition is used when API is called by the speech recognition engine from other service applications that require speech recognition capabilities.
    Speech recognition engine provides its speech recognition feature through RESTful API end point service.
    For services that utilize this application, this is accessible regardless of the system environment, and the features provided here allow various speech recognition-based AI services to be implemented.
    Acoustic model adaptive learning
    Speech recognition turns speech data into text using a pre-learned model. In this process, the leaning model can be divided into acoustic model (AM) and language model (LM). AM learns by statistically modeling acoustic properties of the speech data, and adaptive learning is possible in which a speech property is added using a baseline model provided by the speech recognition engine. Adaptive learning can be performed on existing baseline models by using recorded speech data transcription data collected in specific fields (call center, etc.) as learning data. The acoustic model learned by Long Short-Term Memory (LSTM) provides speech recognition performance better than HMM and DNN, and can provide specialized speech recognition functions specific to the field.
    Language model learning
    Reflecting the characteristics of linguistic expressions used in specific fields (financial, call center, etc.), it can provide speech recognition functions specific to the service and better quality through language models learning. The language model learns grammatical rules such as vocabulary selection and structures of the sentences that are converted to text. It can collect a large number of corpus and learn them statistically, or define arbitrary rules using a formal language
    Providing high-quality speech recognition model
    The acoustic model and language model provided by the speech recognition engine include a baseline model that ensures high performance through 1,200 hours of learning Korean data.

    Main Features

    The following table shows the results of adaptive learning-based speech recognition quality evaluation. Corr (Correct) is the number of correctly recognized syllable units, Acc (Accuracy), the number of correct answers given insertion and deletion errors, H (hit), the number of correctly recognized words, D (deletion), the number of cases of recognized as silence, S (substitution), the number of cases in which it was recognized as a different word, and I (insertion) is the number of cases in which silence was recognized as another syllable. In the case of the baseline prior to adaptive learning, both the acoustic and language models had a correct answer rate of 70% or less, but after adaptive learning, the rate rose to 97% for both models. The developed speech recognition technology is utilized in various environments, such as speech recognition, text analysis, and building of call bot systems in chatbots and call centers, etc.

    Untitled web 3

    < Quality evaluation of adaptive learning-based speech recognition technology >

    Main engine screen

    Untitled web 10
    Untitled web 11