Speech Synthesis Engine


    Speech Synthesis Engine TTS

    Speech synthesis refers to artificially synthesizing human voices, sometimes referred to as text-to-speech (TTS) for its conversion from text to speech.

    AI Suite’s speech synthesis engine learns human voices and artificially creates human voices with tone and intonation similar to those learned from a given sentence. In particular, speech synthesis models can be created with voices reflecting the characteristics of each user by learning real-time the voice files of specific individuals or domains, along with the voices of the models already learned. The speech synthesis engine can be used for various AI services by providing these learned models as individual services through End-Point.

    Main Features

    Natural and fast speech synthesis
    Recently, many speech synthesis products were designed to synthesize voice through deep learning to make up for the shortcomings in existing methods, but it is difficult to satisfy both quality and performance. Saltlux's speech synthesis engine uses the Tacotron model to secure performance during the process of its learning, and Hybrid-Tacotron Deep Learning Model which applies Tacotron2 to ensure quality of natural speech synthesis in transition learning.
    This learning method learns the voice data of real people to provide stable voice quality and high speech synthesis quality.
    In addition, the speech synthesis process is put in parallel to obtain a speed at which commercial services can be implemented.
    Highly efficient personalized speech synthesis
    Saltlux's speech synthesis engine uses a transfer learning method in which the engine additionally learns a new speaker's speech data based on an existing, well-trained model. The transfer learning process allows the speaker's voice to be synthesized with learning data only about 30 minutes long.
    This is an advantage of high-efficiency, since it reduces the cost of large-scale voice transcription operations.
    Domain-specific Korean notation conversion and speech synthesis
    Most Korean speech synthesis engines have difficulty synthesizing speech for non-Korean pronunciations such as English words, numbers, and units. Since many service fields still use both Hangul and non-Korean letters, it should be able to generate pronunciations for various non-Hangul notations in order to provide a smooth speech synthesis service in the field. Saltlux's speech synthesis engine provides conversion features for smooth conversion of non-Hangul notations and phonetic symbols for English words.

    Main features and specifications

    The functions of the speech synthesis engine can largely be divided into two areas, i.e., learning management and service management. The learning management area creates or manages a new speech synthesized model by learning specific voice data. The service management area is responsible for turning the speech synthesis model learned through the engine into a service and distributing and managing it, so that other service applications can access and use it


    < Speech recognition engine system construction >

    Learning data management
    It provides the function to register and manage learning data needed for speech synthesis learning. Learning data is comprised of a voice file and a transcription file that contains the contents of the voice as text. It can upload data sets with varying degrees of lengths, volumes, and speakers and apply them selectively when learning speech synthesis models.
    Dictionary management
    We are utilizing language dictionaries to preprocess text sentences input in the process of speech synthesis. It can register in the dictionary and manage words that need to be converted in advance or words whose pronunciations can vary depending on a specific domain. It can set up and apply different types of dictionaries for each speech service.
    Learning Management
    You can learn the speech synthesis models through learning management features. By selecting the learning data and adjusting the parameters necessary for learning, it can generate a desired speech synthesis model, or perform transfer learning by adding it to the previously learned model. You can either check the progress of the model you are learning, or check its quality by testing the speech synthesis feature, maintain a list of numerous versions of the learned model, and distribute it to the services you need.
    Service management
    Speech synthesis is also commonly used by calling APIs provided by the speech synthesis engine in other service applications that need speech conversion features. The service management function creates and manages a RESTful-based service interface that provides speech synthesis. Through the service management function, you can set whether to activate the service for the speech synthesis model and the available system resources (processes), and create and provide an End-point that allows other services to call the model.

    Main Features

    Saltlux uses a speaker adaptation method using transfer learning technology to realize personalized service for speech synthesis. The transfer learning is a method to learn a new model with similar problems is trained by using existing, well-trained models. Since the transfer learning increases the learning efficiency of the new model, it can achieve high performance by fine-tuning the weight values ​​of the already learned model in a meaningful way even with a small amount of data. By basing itself on A model which is well-learned by a sufficient amount of data, the transfer learning is able to learn efficiently the voice of B which has insufficient amount of data.

    If there is a big difference between the distributions of pre-learned model A and model B to be learned additionally during transfer learning, the speech synthesis performance will be greatly reduced. To solve this data mismatch, Saltlux adds a Semi-Supervised Learning method which pre-learned only a portion of the speech synthesis network by using data gathered for tens of hours comprised of multiple human voices. Semi-supervised learning is carried out with voice data only, without transcription data, which is similar in principle with humans learn to speak before they learn to write. Saltlux applies a semi-supervised learning model along with the transfer learning to speech synthesis, thus dramatically reducing learning time and maximizing speech synthesis performance. Saltlux’s speech synthesis engine is now capable of generating high-quality voices by learning a specific person’s voice with 30 minutes of voice data.

    Main engine screen

    Untitled web 12
    Untitled web 13