In the fast-evolving field of conversational AI, delivering natural, real-time speech responses is essential. Amazon has introduced the Bidirectional Streaming API for Amazon Polly, designed to enable simultaneous text input and audio output during speech synthesis. This advancement allows applications, particularly those using large language models (LLMs), to begin audio playback before the full text response is generated, significantly reducing latency.
Addressing Limitations of Traditional Text-to-Speech
Traditional text-to-speech (TTS) services operate on a request-response model, requiring the entire text input before initiating synthesis. While Amazon Polly previously supported streaming audio back to users incrementally, the input side remained a bottleneck, as text had to be fully available before being sent for synthesis. This model causes delays in conversational AI scenarios where LLMs generate text token-by-token, forcing users to wait for complete responses before hearing audio.
Features of the Bidirectional Streaming API
The new StartSpeechSynthesisStream API introduces true duplex communication over a single HTTP/2 connection, enabling clients to:
- Stream text incrementally as it is generated
- Receive synthesized audio bytes in real time
- Control synthesis timing through flush configurations
Key components include:
- TextEvent: client sends text to Polly
- CloseStreamEvent: signals end of text input
- AudioEvent: Polly sends back synthesized audio chunks
- StreamClosedEvent: confirms stream completion
This approach eliminates the need for complex server-side text separation and multiple API calls, simplifying infrastructure and lowering latency.
Performance Improvements
Benchmark tests comparing the traditional SynthesizeSpeech API to the new bidirectional streaming method showed:
- A 39% reduction in total processing time (from ~115 seconds to ~70 seconds for 7,045 characters)
- A 27-fold decrease in API calls (from 27 to 1)
- Similar audio output sizes
By streaming text word-by-word as it is generated, Amazon Polly can begin synthesizing audio immediately, reducing end-to-end latency and enhancing user experience in conversational AI applications.
Implementation and Availability
The bidirectional streaming API is accessible via AWS SDKs for Java, JavaScript, .NET, C++, Go, Kotlin, PHP, Ruby, Rust, and Swift. It currently does not support Python, .NET v3, or AWS CLI tools. Developers can integrate this feature to improve real-time speech synthesis in virtual assistants, chatbots, and other conversational platforms.
Read more: aws.amazon.com