Google Launches Gemini 3.1 Flash TTS With Audio Tags for Granular Voice Control
Summary
Google released Gemini 3.1 Flash TTS on April 15, a text-to-speech model that introduces inline audio tags allowing developers to control vocal style, pacing, accent, and emotional delivery at the sentence level. The model is available in preview through the Gemini API, Google AI Studio, Vertex AI, and Google Vids.
The key differentiator is the tag system, which lets developers embed natural language commands directly into text input to steer speech output mid-sentence — shifting from excited to calm, or from a whisper to full voice, without separate API calls. Google frames this as a “director’s chair” approach, with three layers of control: scene direction for setting environmental context, speaker-level profiles for assigning distinct voices and accents, and seamless export to lock those parameters into reproducible API code.
On the Artificial Analysis TTS leaderboard, which aggregates thousands of blind human preference ratings, 3.1 Flash TTS scored an Elo of 1,211, placing it second behind Inworld TTS 1.5 Max at 1,215 and ahead of ElevenLabs v3 at 1,179. The model supports over 70 languages with native multi-speaker dialogue.
The release signals a shift in how TTS models compete: raw voice quality is converging across providers, pushing differentiation toward fine-grained controllability and developer tooling. Whether audio tags can handle real-time intent shifts during live conversation — where a speaker pivots from statement to sarcasm mid-stream — remains an open question for production deployments.
Enterprise availability through Vertex AI positions the model for integration into customer service, media production, and accessibility workflows at scale.
Source: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-flash-tts/


