Microsoft’s plan to implement ChatGPT in its core solutions is still in progress. There’s already information about apps coming to Bing in the first quarter of this year and coming within the Office suite. But ChatGPT is not alone, Microsoft has a new ace up its sleeve.
Its name is VALL-E and it is a language model for text-to-speech synthesis (TTS). Microsoft promises that you only need three seconds of audio recording for the system to emulate your voice.
Microsoft wants AI in everything
Microsoft has a new text-to-speech model that can emulate any voice in just three seconds of recording. One of the most interesting points the company shared in its documentation is that they are developing VALL-E to work with other productive AI models like the GPT-3.
In other words, ChatGPT itself can: provide us with audio results after this model is integrated. “Imitating the voice of Chiquito de la calzada” will be possible as long as the necessary pre-training is done.
The examples Microsoft shows are simply amazing. In them, it shows us what the underlying audio input is, the intermediate steps and the final result of VALL-E. The model not only imitates the sound, but also original rhythm of language and the original pitch on which the audio input was recorded.
This is nothing particularly new, and Google already boasted similar models years ago. But applications of Google’s strongest AIs in popular solutions aren’t as available as Microsoft had planned. In the browser, we will have AI in office automation applications, and as detailed now, this AI will also be voice.