Amazon Nova Sonic.
Most voice stacks chain three boxes: speech-to-text, then an LLM, then text-to-speech. Every hop adds latency and throws away prosody โ the model never hears how you said it. Amazon Nova Sonic collapses the chain into a single speech-to-speech model on Bedrock. You open one persistent bidirectional stream, push raw microphone audio in as it's captured, and the model streams spoken audio back while you're still talking โ understanding and generation in one model, one connection, no orchestration glue.
client.invoke_model_with_bidirectional_stream(model_id="amazon.nova-sonic-v1:0")
01What it is, in one breath
Nova Sonic is a model that "provides real-time, conversational interactions through bidirectional audio streaming," processing and responding to "real-time speech as it occurs." The docs frame it as a unified speech understanding and generation architecture โ the same model that hears the audio also produces the reply, rather than a transcribe-then-synthesize pipeline.
That unification buys things a pipeline can't. Because the model hears the audio directly, it offers adaptive speech response that dynamically adjusts delivery based on the prosody of the input speech โ match the energy of an excited caller, slow down for a hesitant one. And it gives you graceful handling of user interruptions without dropping conversational context: the caller can talk over the model and the thread survives.
One model, one socket. There is no separate ASR box and no separate TTS box to wire together โ speech goes in and speech comes out over a single open connection, and the model keeps the prosody a transcript would have discarded.
02The bidirectional contract
Nova Sonic uses the InvokeModelWithBidirectionalStream API.
Unlike a request-response call, it "maintains an open channel for continuous audio
streaming in both directions." The architecture is explicitly event-driven: client and
model exchange structured JSON events over a persistent connection, with
three things happening at once on the wire:
- continuous audio streaming from the user to the model,
- concurrent speech processing and generation, and
- real-time model responses without waiting for complete utterances.
The transport is HTTP/2 โ the SDK examples default to it โ authenticated with SigV4
against the bedrock-runtime endpoint. The bidirectional API is supported
across the AWS SDKs for .NET, C++, Java, JavaScript, Kotlin, Ruby, Rust, and Swift;
Python developers get a dedicated experimental SDK for the streaming calls.
03The event lifecycle, in order
Every conversation walks the same scripted sequence. Get the order wrong and the model loses state, so it's worth memorising:
sessionStartโ carries theinferenceConfiguration(maxTokens,topP,temperature).promptStartโ defines the audio output format and tool config, and assigns a uniquepromptNamethat must appear on every subsequent event.contentStartโ content โcontentEndโ a three-part pattern repeated for each interaction type, wherecontentStartdeclares the content type and a role ofSYSTEM,USER,ASSISTANT, orTOOL, and carries its own uniquecontentName.promptEnd, thensessionEndto close.
The two identifiers form a hierarchy: promptName ties the whole conversation
together, while each contentName marks the boundaries of one content block.
Conversation history, if you supply it, goes in exactly once โ after the
system prompt and before audio streaming begins โ using the same
contentStart/textInput/contentEnd pattern with
USER and ASSISTANT roles per message.
04Streaming audio frames
Once the audio contentStart is open, you stream the microphone in. The docs
are specific: audio frames are approximately 32 ms each, captured
directly from the mic and sent immediately as audioInput events that reuse
the same contentName. They should be streamed "in real-time as they're
captured, maintaining the natural microphone sampling cadence" โ don't batch them.
All audio frames share a single content container until the conversation
ends and it's explicitly closed.
{
"event": {
"audioInput": {
"promptName": "<uuid>",
"contentName": "<uuid>",
"content": "<base64EncodedAudioData>"
}
}
}
The audio itself is LPCM, 16-bit, mono (channelCount: 1),
base64-encoded, with sampleRateHertz of 8000, 16000,
or 24000. AWS's own console example captures the mic at 16 kHz
in and renders the model's voice at 24 kHz out โ input and
output sample rates are configured independently.
05What the model streams back
Output is events too, arriving while the user is still mid-sentence. The ones you handle in the receive loop:
textOutputโ text transcription of the user's speech (ASR, roleUSER) and the model's text reply (roleASSISTANT), so you get a live transcript for free.audioOutputโ base64 audio chunks for the spoken reply; decode and play them as they land.toolUseโ a function-call request naming the tool and carrying atoolUseId.
Two subtleties live in the output stream. A barge-in โ the user talking
over the model โ is surfaced by the model sending a content notification; in the
documented event schema it shows up as stopReason: "INTERRUPTED"
on a text contentEnd, your cue to stop playback and yield the floor. (AWS's
Python sample also watches for an { "interrupted" : true } marker in the
streamed text to flip its own barge-in flag.) And contentStart can carry an
additionalModelFields flag with
generationStage: SPECULATIVE, marking text the model is
generating ahead of confirmation โ the sample code uses it to decide whether to display
assistant text yet.
06Tools, RAG, and agentic flows
Nova Sonic isn't limited to its pretrained knowledge. It supports tool use
(function calling) for "integration with external functions, APIs, and data
sources," declared in the toolConfiguration on promptStart with
a name, description, and JSON input schema. You can steer which tool fires with the
toolChoice parameter. On the same machinery, the docs describe
Retrieval-Augmented Generation (RAG) for knowledge grounding with
enterprise data and agentic flows that compose multiple tool calls โ all
driven through the TOOL role and toolResult events without ever
leaving the voice stream.
The loop is the natural one: the model emits a toolUse event, your code runs
the function, and you feed the answer back via a contentStart (role
TOOL) โ toolResult โ contentEnd triple bearing the
original toolUseId. The model folds the result into its spoken reply.
07Limits worth knowing
- Fixed audio shape. Input and output are LPCM, 16-bit, single channel only; sample rate must be one of 8000 / 16000 / 24000 Hz. There's no MP3/Opus path in the stream โ encode to raw PCM first.
- Event order is load-bearing.
sessionStartโpromptStartโ content triples โpromptEndโsessionEnd. The docs warn that skipping any closing event "can result in incomplete conversations or orphaned resources." - History goes in once. Conversation history is allowed only after the system prompt and before audio begins โ you can't splice it in mid-stream.
- One audio container. All audio frames live in a single content block keyed by one
contentName; you open it once and close it once at the end. - Voices are tied to language. Each locale has a fixed, named voice set โ there's no arbitrary voice cloning, and English (GB) ships a single listed voice.
- A newer generation exists. The V1 model documented here is
amazon.nova-sonic-v1:0; AWS now also publishes a separate Amazon Nova 2 Sonic guide, so confirm which generation you're targeting before wiring model IDs.
08Voices and languages
Nova Sonic ships expressive voices across five languages (with two
English locales), each voice tied to a language. You pick one with voiceId in
the audioOutputConfiguration on promptStart.
| Language | Feminine-sounding | Masculine-sounding |
|---|---|---|
| English (US) | tiffany | matthew |
| English (GB) | amy | โ |
| French | ambre | florian |
| Italian | beatrice | lorenzo |
| German | greta | lennart |
| Spanish | lupe | carlos |
09Try it in five minutes
Stand up the AWS console-Python sample and have a spoken conversation end to end. The key constants and model ID come straight from the docs:
# pip install pyaudio and the experimental Bedrock streaming SDK
INPUT_SAMPLE_RATE = 16000 # mic capture
OUTPUT_SAMPLE_RATE = 24000 # model voice
CHANNELS = 1 # mono LPCM, 16-bit
class SimpleNovaSonic:
def __init__(self, model_id="amazon.nova-sonic-v1:0", region="us-east-1"):
self.model_id = model_id
self.region = region
self.prompt_name = str(uuid.uuid4()) # ties every event together
self.audio_content_name = str(uuid.uuid4()) # one container for all frames
async def start_session(self):
self.stream = await self.client.invoke_model_with_bidirectional_stream(
InvokeModelWithBidirectionalStreamOperationInput(model_id=self.model_id))
# then: sessionStart -> promptStart -> contentStart(SYSTEM) ...
# -> contentStart(AUDIO) -> stream 32ms audioInput frames
# -> contentEnd -> promptEnd -> sessionEnd
Send sessionStart and promptStart (set voiceId:
"matthew", sampleRateHertz: 24000 for output), open an audio
contentStart, then pump 32 ms mic frames as audioInput events.
Decode audioOutput chunks back to your speakers, and watch for the documented
stopReason: "INTERRUPTED" notification (AWS's sample also checks for an
{ "interrupted" : true } marker) so you yield the floor when the user talks
over the model. The full runnable file is in the amazon-nova-samples GitHub
repo under speech-to-speech/.
Sources: Using the Amazon Nova Sonic Speech-to-Speech model, Using the Bidirectional Streaming API, Handling input events with the bidirectional API, Speech-to-speech Example, Voices available for Amazon Nova Sonic.
If the docs change, this tip is a snapshot of that day โ check the sources for current behaviour.
This page โ research, writing, verification, and deployment โ was built by Claude Cowork. No human touched the prose, the layout, or the upload pipeline. The tip was generated this morning, cross-checked against the official AWS docs by an independent verification pass, and published to Cloudflare R2 on a schedule.