v vanemmerik.ai / aws-ai
AWS AI Daily 2026 ยท 06 ยท 14 โ‰ˆ 7 min read Amazon Nova ยท Nova Sonic

Amazon Nova Sonic.

Most voice stacks chain three boxes: speech-to-text, then an LLM, then text-to-speech. Every hop adds latency and throws away prosody โ€” the model never hears how you said it. Amazon Nova Sonic collapses the chain into a single speech-to-speech model on Bedrock. You open one persistent bidirectional stream, push raw microphone audio in as it's captured, and the model streams spoken audio back while you're still talking โ€” understanding and generation in one model, one connection, no orchestration glue.

โ€บ client.invoke_model_with_bidirectional_stream(model_id="amazon.nova-sonic-v1:0")

01What it is, in one breath

Nova Sonic is a model that "provides real-time, conversational interactions through bidirectional audio streaming," processing and responding to "real-time speech as it occurs." The docs frame it as a unified speech understanding and generation architecture โ€” the same model that hears the audio also produces the reply, rather than a transcribe-then-synthesize pipeline.

That unification buys things a pipeline can't. Because the model hears the audio directly, it offers adaptive speech response that dynamically adjusts delivery based on the prosody of the input speech โ€” match the energy of an excited caller, slow down for a hesitant one. And it gives you graceful handling of user interruptions without dropping conversational context: the caller can talk over the model and the thread survives.

Key insight

One model, one socket. There is no separate ASR box and no separate TTS box to wire together โ€” speech goes in and speech comes out over a single open connection, and the model keeps the prosody a transcript would have discarded.

02The bidirectional contract

Nova Sonic uses the InvokeModelWithBidirectionalStream API. Unlike a request-response call, it "maintains an open channel for continuous audio streaming in both directions." The architecture is explicitly event-driven: client and model exchange structured JSON events over a persistent connection, with three things happening at once on the wire:

The transport is HTTP/2 โ€” the SDK examples default to it โ€” authenticated with SigV4 against the bedrock-runtime endpoint. The bidirectional API is supported across the AWS SDKs for .NET, C++, Java, JavaScript, Kotlin, Ruby, Rust, and Swift; Python developers get a dedicated experimental SDK for the streaming calls.

03The event lifecycle, in order

Every conversation walks the same scripted sequence. Get the order wrong and the model loses state, so it's worth memorising:

  1. sessionStart โ€” carries the inferenceConfiguration (maxTokens, topP, temperature).
  2. promptStart โ€” defines the audio output format and tool config, and assigns a unique promptName that must appear on every subsequent event.
  3. contentStart โ†’ content โ†’ contentEnd โ€” a three-part pattern repeated for each interaction type, where contentStart declares the content type and a role of SYSTEM, USER, ASSISTANT, or TOOL, and carries its own unique contentName.
  4. promptEnd, then sessionEnd to close.

The two identifiers form a hierarchy: promptName ties the whole conversation together, while each contentName marks the boundaries of one content block. Conversation history, if you supply it, goes in exactly once โ€” after the system prompt and before audio streaming begins โ€” using the same contentStart/textInput/contentEnd pattern with USER and ASSISTANT roles per message.

04Streaming audio frames

Once the audio contentStart is open, you stream the microphone in. The docs are specific: audio frames are approximately 32 ms each, captured directly from the mic and sent immediately as audioInput events that reuse the same contentName. They should be streamed "in real-time as they're captured, maintaining the natural microphone sampling cadence" โ€” don't batch them. All audio frames share a single content container until the conversation ends and it's explicitly closed.

{
  "event": {
    "audioInput": {
      "promptName": "<uuid>",
      "contentName": "<uuid>",
      "content": "<base64EncodedAudioData>"
    }
  }
}

The audio itself is LPCM, 16-bit, mono (channelCount: 1), base64-encoded, with sampleRateHertz of 8000, 16000, or 24000. AWS's own console example captures the mic at 16 kHz in and renders the model's voice at 24 kHz out โ€” input and output sample rates are configured independently.

05What the model streams back

Output is events too, arriving while the user is still mid-sentence. The ones you handle in the receive loop:

Two subtleties live in the output stream. A barge-in โ€” the user talking over the model โ€” is surfaced by the model sending a content notification; in the documented event schema it shows up as stopReason: "INTERRUPTED" on a text contentEnd, your cue to stop playback and yield the floor. (AWS's Python sample also watches for an { "interrupted" : true } marker in the streamed text to flip its own barge-in flag.) And contentStart can carry an additionalModelFields flag with generationStage: SPECULATIVE, marking text the model is generating ahead of confirmation โ€” the sample code uses it to decide whether to display assistant text yet.

06Tools, RAG, and agentic flows

Nova Sonic isn't limited to its pretrained knowledge. It supports tool use (function calling) for "integration with external functions, APIs, and data sources," declared in the toolConfiguration on promptStart with a name, description, and JSON input schema. You can steer which tool fires with the toolChoice parameter. On the same machinery, the docs describe Retrieval-Augmented Generation (RAG) for knowledge grounding with enterprise data and agentic flows that compose multiple tool calls โ€” all driven through the TOOL role and toolResult events without ever leaving the voice stream.

The loop is the natural one: the model emits a toolUse event, your code runs the function, and you feed the answer back via a contentStart (role TOOL) โ†’ toolResult โ†’ contentEnd triple bearing the original toolUseId. The model folds the result into its spoken reply.

07Limits worth knowing

08Voices and languages

Nova Sonic ships expressive voices across five languages (with two English locales), each voice tied to a language. You pick one with voiceId in the audioOutputConfiguration on promptStart.

LanguageFeminine-soundingMasculine-sounding
English (US)tiffanymatthew
English (GB)amyโ€”
Frenchambreflorian
Italianbeatricelorenzo
Germangretalennart
Spanishlupecarlos

09Try it in five minutes

Stand up the AWS console-Python sample and have a spoken conversation end to end. The key constants and model ID come straight from the docs:

# pip install pyaudio and the experimental Bedrock streaming SDK
INPUT_SAMPLE_RATE  = 16000   # mic capture
OUTPUT_SAMPLE_RATE = 24000   # model voice
CHANNELS = 1                 # mono LPCM, 16-bit

class SimpleNovaSonic:
    def __init__(self, model_id="amazon.nova-sonic-v1:0", region="us-east-1"):
        self.model_id = model_id
        self.region   = region
        self.prompt_name        = str(uuid.uuid4())   # ties every event together
        self.audio_content_name = str(uuid.uuid4())   # one container for all frames

    async def start_session(self):
        self.stream = await self.client.invoke_model_with_bidirectional_stream(
            InvokeModelWithBidirectionalStreamOperationInput(model_id=self.model_id))
        # then: sessionStart -> promptStart -> contentStart(SYSTEM) ...
        #       -> contentStart(AUDIO) -> stream 32ms audioInput frames
        #       -> contentEnd -> promptEnd -> sessionEnd

Send sessionStart and promptStart (set voiceId: "matthew", sampleRateHertz: 24000 for output), open an audio contentStart, then pump 32 ms mic frames as audioInput events. Decode audioOutput chunks back to your speakers, and watch for the documented stopReason: "INTERRUPTED" notification (AWS's sample also checks for an { "interrupted" : true } marker) so you yield the floor when the user talks over the model. The full runnable file is in the amazon-nova-samples GitHub repo under speech-to-speech/.

โœ“Verified against the official AWS docs on 2026-06-14.
Sources: Using the Amazon Nova Sonic Speech-to-Speech model, Using the Bidirectional Streaming API, Handling input events with the bidirectional API, Speech-to-speech Example, Voices available for Amazon Nova Sonic.
If the docs change, this tip is a snapshot of that day โ€” check the sources for current behaviour.
Heads up โ€” this tip is from 2026-06-14. AWS services move fast โ€” check the Nova Sonic speech-to-speech user guide before relying on specifics, then come back for today's tip โ†’
C

This page โ€” research, writing, verification, and deployment โ€” was built by Claude Cowork. No human touched the prose, the layout, or the upload pipeline. The tip was generated this morning, cross-checked against the official AWS docs by an independent verification pass, and published to Cloudflare R2 on a schedule.

A daily experiment by Monty van Emmerik ยท vanemmerik.ai ยท what is Claude Cowork?