Speech-to-speech translation (S2ST) is key to breaking down language barriers between people all over the world. Automatic S2ST systems are typically composed of a cascade of speech recognition, machine translation, and speech synthesis subsystems. However, such cascade systems may suffer from longer latency, loss of information (especially paralinguistic and non-linguistic information), and compounding errors between subsystems.
In 2019, Google AI introduced Translatotron, the first-ever model that was able to directly translate speech between two languages. This direct S2ST model was able to be efficiently trained end-to-end and also had the unique capability of retaining the source speaker’s voice (which is non-linguistic information) in the translated speech. However, despite its ability to produce natural-sounding translated speech in high fidelity, it still under performed compared to a strong baseline cascade S2ST system.
In “Translatotron 2: Robust direct speech-to-speech translation”, Google describes an improved version of Translatotron that significantly improves performance while also applying a new method for transferring the source speakers’ voices to the translated speech. The revised approach to voice transference is successful even when the input speech contains multiple speakers speaking in turns while also reducing the potential for misuse and better aligning with the AI Principles. Experiments on three different collections consistently showed that Translatotron 2 outperforms the original Translatotron by a large margin on translation quality, speech naturalness, and speech robustness.
Translatotron 2 is composed of four major components:
- A speech encoder
- A target phoneme decoder
- A target speech synthesizer
- An attention module that connects them together.
The combination of the encoder, the attention module, and the decoder is similar to a typical direct speech-to-text translation (ST) model. The synthesizer is conditioned on the output from both the decoder and the attention.
The key changes made in Translatotron 2 are listed below:
- The output from the target phoneme decoder is one of the inputs to the spectrogram synthesizer in Translatotron 2. It is, therefore, easy to train and performs better as a result of its strong conditioning.
- The spectrogram synthesizer used in Translatotron 2 is duration-based, which remarkably improves the robustness of the synthesized speech.
- The attention-based connection in Translatotron 2 is driven by the phoneme decoder instead of the spectrogram synthesizer. This aligns the acoustic information the spectrogram synthesizer sees with the translated material it’s synthesizing, allowing each speaker’s voice to be preserved throughout speaker turns.
Powerful and Responsible Voice Retention
The original Translatotron was able to retain the source speaker’s voice in the translated speech, by conditioning its decoder on a speaker embedding generated from a separately trained speaker encoder. However, this approach also enabled it to generate the translated speech in a different speaker’s voice if a clip of the target speaker’s recording was used as the reference audio to the speaker encoder, or if the embedding of the target speaker were directly available. While this capability was powerful, it had the potential to be misused to spoof audio with arbitrary content, which posed a concern for production deployment.
To address this, Google designed Translatotron 2 to use only a single speech encoder, which is responsible for both linguistic understanding and voice capture. In this way, the trained models cannot be directed to reproduce non-source voices. This approach can also be applied to the original Translatotron.
To retain speakers’ voices across translation, researchers generally prefer to train S2ST models on parallel utterances with the same speaker’s voice on both sides. Such a dataset with human recordings on both sides is extremely difficult to collect because it requires a large number of fluent bilingual speakers. To avoid this difficulty, Google uses a modified version of PnG NAT, a TTS model that is capable of cross-lingual voice transferring to synthesize such training targets. Its modified PnG NAT model incorporates a separately trained speaker encoder in the same way as in its previous TTS work, the same strategy used for the original Translatotron so that it is capable of zero-shot voice transference.