Amazon Polly Launches Bidirectional Streaming API for Real-Time Conversational AI Speech Synthesis

In the fast-evolving field of conversational AI, delivering natural, real-time speech responses is essential. Amazon has introduced the Bidirectional Streaming API for Amazon Polly, designed to enable simultaneous text input and audio output during speech synthesis. This advancement allows applications, particularly those using large language models (LLMs), to begin audio playback before the full text response is generated, significantly reducing latency.

Addressing Limitations of Traditional Text-to-Speech

Traditional text-to-speech (TTS) services operate on a request-response model, requiring the entire text input before initiating synthesis. While Amazon Polly previously supported streaming audio back to users incrementally, the input side remained a bottleneck, as text had to be fully available before being sent for synthesis. This model causes delays in conversational AI scenarios where LLMs generate text token-by-token, forcing users to wait for complete responses before hearing audio.

Features of the Bidirectional Streaming API

The new StartSpeechSynthesisStream API introduces true duplex communication over a single HTTP/2 connection, enabling clients to:

Stream text incrementally as it is generated
Receive synthesized audio bytes in real time
Control synthesis timing through flush configurations

Key components include:

TextEvent: client sends text to Polly
CloseStreamEvent: signals end of text input
AudioEvent: Polly sends back synthesized audio chunks
StreamClosedEvent: confirms stream completion

This approach eliminates the need for complex server-side text separation and multiple API calls, simplifying infrastructure and lowering latency.

Performance Improvements

Benchmark tests comparing the traditional SynthesizeSpeech API to the new bidirectional streaming method showed:

A 39% reduction in total processing time (from ~115 seconds to ~70 seconds for 7,045 characters)
A 27-fold decrease in API calls (from 27 to 1)
Similar audio output sizes

By streaming text word-by-word as it is generated, Amazon Polly can begin synthesizing audio immediately, reducing end-to-end latency and enhancing user experience in conversational AI applications.

Implementation and Availability

The bidirectional streaming API is accessible via AWS SDKs for Java, JavaScript, .NET, C++, Go, Kotlin, PHP, Ruby, Rust, and Swift. It currently does not support Python, .NET v3, or AWS CLI tools. Developers can integrate this feature to improve real-time speech synthesis in virtual assistants, chatbots, and other conversational platforms.

Read more: aws.amazon.com

Addressing Limitations of Traditional Text-to-Speech

Features of the Bidirectional Streaming API

The new StartSpeechSynthesisStream API introduces true duplex communication over a single HTTP/2 connection, enabling clients to:

Stream text incrementally as it is generated

Receive synthesized audio bytes in real time

Control synthesis timing through flush configurations

Key components include:

TextEvent: client sends text to Polly

CloseStreamEvent: signals end of text input

AudioEvent: Polly sends back synthesized audio chunks

StreamClosedEvent: confirms stream completion

This approach eliminates the need for complex server-side text separation and multiple API calls, simplifying infrastructure and lowering latency.

Performance Improvements

Benchmark tests comparing the traditional SynthesizeSpeech API to the new bidirectional streaming method showed:

A 39% reduction in total processing time (from ~115 seconds to ~70 seconds for 7,045 characters)

A 27-fold decrease in API calls (from 27 to 1)

Similar audio output sizes

By streaming text word-by-word as it is generated, Amazon Polly can begin synthesizing audio immediately, reducing end-to-end latency and enhancing user experience in conversational AI applications.

Implementation and Availability

Amazon Polly Launches Bidirectional Streaming API for Real-Time Conversational AI Speech Synthesis

Addressing Limitations of Traditional Text-to-Speech

Features of the Bidirectional Streaming API

Performance Improvements

Implementation and Availability

Resources

Amazon Polly Launches Bidirectional Streaming API for Real-Time Conversational AI Speech Synthesis

Addressing Limitations of Traditional Text-to-Speech

Features of the Bidirectional Streaming API

Performance Improvements

Implementation and Availability

Resources

Amazon Polly Launches Bidirectional Streaming API for Real-Time Conversational AI Speech Synthesis

Article Content

Addressing Limitations of Traditional Text-to-Speech

Features of the Bidirectional Streaming API

Performance Improvements

Implementation and Availability

Resources

Share this article

Amazon Polly Launches Bidirectional Streaming API for Real-Time Conversational AI Speech Synthesis

Article Content

Addressing Limitations of Traditional Text-to-Speech

Features of the Bidirectional Streaming API

Performance Improvements

Implementation and Availability

Resources

Share this article