Google has released Gemini 3.1 Flash TTS, a text-to-speech model designed to eliminate the robotic quality of synthetic voices. The update, announced Wednesday, introduces new features aimed at making AI-generated audio sound more natural.
The model allows users to embed natural language tags in scripts to control tone, pace, and emotional delivery. For example, a phrase like “<emotion=excited>” could signal the AI to raise its pitch and speed. This feature targets content creators who need expressive narration without manual editing.
Gemini 3.1 Flash TTS also supports multi-character dialogues within a single audio file. A podcast host, for instance, could simulate a conversation between two speakers without separate recordings. The tool currently covers 70 languages, prioritizing fluency and clarity.
To combat misinformation, Google has integrated SynthID, a system that embeds invisible watermarks in generated audio. These watermarks help verify whether the clip was produced by an AI model, reducing risks from deepfakes. The watermarking process runs automatically during synthesis.
Developers can access the model through Google’s API or Google AI Studio, offering a cost-effective way to produce high-quality voiceovers for videos, apps, and podcasts. The company positions this as a faster alternative to traditional voice recording pipelines.
Resources: blog.google