Microsoft Introduces VALL-E, a Text-to-Speech AI Capable of Cloning Voices with Only Three Seconds of Audio

The powerful new technology raises concerns about potential misuse and the need for proper regulation.

Microsoft has recently developed a new text-to-speech AI called VALL-E, which can clone your voice, tone and all, from just a three-second snippet of audio. The underlying technology behind the system is complex, but in practice, using the system is simple. Plug in an audio sample and some text and the AI generates real-sounding speech.

Many text-to-speech apps already exist, however, most of them require a large amount of input and have not yet figured out how to make AI voices sound particularly human. With VALL-E, Microsoft has managed to achieve this with minimal input required. The potential applications of VALL-E include "zero-shot TTS, speech editing, and content creation."

While VALL-E has many potential uses, there are also concerns about its potential misuse, such as creating fake and misleading sound bytes. Microsoft has addressed this concern by refraining from making the code open-source and is working on incorporating a system that detects whether audio was created using VALL-E. However, this raises the question of how easy it is to detect such audio.