Google DeepMind has launched a new text-to-speech model designed to give users greater control over AI-generated voice output. The Gemini 3.1 Flash TTS system introduces granular audio tags, which allow precise direction of speech characteristics such as tone, emphasis, and pacing. This marks a shift from generic synthetic voices to more customizable audio generation.
The update enables developers to embed specific instructions within the text. For example, tags like <whisper>, <slow>, or <loud> can be inserted to modify how the AI delivers the spoken content. These tags function as direct commands rather than suggestions, ensuring consistent results across different voices and languages.
According to the company’s blog post, the technology addresses a longstanding limitation in AI speech synthesis. Traditional TTS systems often produce flat or robotic output unless manually adjusted. The new system automates this process by interpreting embedded tags in real time, producing speech that aligns with user intent without requiring post-processing.
The model supports multiple languages and is optimized for low-latency applications. Google DeepMind states it has been tested across various use cases, including audiobooks, virtual assistants, and accessibility tools. Early adopters report improved naturalness in AI-generated voices compared to previous versions.
The release follows Google’s pattern of iterative updates to its AI models. While the company has not disclosed a public release date for broader availability, the technology is already available for select partners.
Source: deepmind.google