Speech Synthesis Engine
SPEECH SYNTHESIS ENGINE TTS
Speech Synthesis Engine
Speech synthesis refers to artificially synthesizing human voices, sometimes referred to as text-to-speech (TTS) for its conversion from text to speech.
AI Suite’s speech synthesis engine learns human voices and artificially creates human voices with tone and intonation similar to those learned from a given sentence. In particular, speech synthesis models can be created with voices reflecting the characteristics of each user by learning real-time the voice files of specific individuals or domains, along with the voices of the models already learned. The speech synthesis engine can be used for various AI services by providing these learned models as individual services through End-Point.
Natural and fast speech synthesis
This learning method learns the voice data of real people to provide stable voice quality and high speech synthesis quality.
In addition, the speech synthesis process is put in parallel to obtain a speed at which commercial services can be implemented.
Highly efficient personalized speech synthesis
This is an advantage of high-efficiency, since it reduces the cost of large-scale voice transcription operations.
Domain-specific Korean notation conversion and speech synthesis
Main features and specifications
The functions of the speech synthesis engine can largely be divided into two areas, i.e., learning management and service management. The learning management area creates or manages a new speech synthesized model by learning specific voice data. The service management area is responsible for turning the speech synthesis model learned through the engine into a service and distributing and managing it, so that other service applications can access and use it
< Speech recognition engine system construction >
Learning data management
Saltlux uses a speaker adaptation method using transfer learning technology to realize personalized service for speech synthesis. The transfer learning is a method to learn a new model with similar problems is trained by using existing, well-trained models. Since the transfer learning increases the learning efficiency of the new model, it can achieve high performance by fine-tuning the weight values of the already learned model in a meaningful way even with a small amount of data. By basing itself on A model which is well-learned by a sufficient amount of data, the transfer learning is able to learn efficiently the voice of B which has insufficient amount of data.
If there is a big difference between the distributions of pre-learned model A and model B to be learned additionally during transfer learning, the speech synthesis performance will be greatly reduced. To solve this data mismatch, Saltlux adds a Semi-Supervised Learning method which pre-learned only a portion of the speech synthesis network by using data gathered for tens of hours comprised of multiple human voices. Semi-supervised learning is carried out with voice data only, without transcription data, which is similar in principle with humans learn to speak before they learn to write. Saltlux applies a semi-supervised learning model along with the transfer learning to speech synthesis, thus dramatically reducing learning time and maximizing speech synthesis performance. Saltlux’s speech synthesis engine is now capable of generating high-quality voices by learning a specific person’s voice with 30 minutes of voice data.