How We Cut Voice-Agent Latency to Sub-500ms: A Production Architecture

How We Cut Voice-Agent Latency to Sub-500ms: A Production Architecture

A case study in building real-time voice agents: batching, model tiering, flush timers, and keeping the hot path off disk.


Section 1 — Why Latency Is the Product

Voice agents fail differently from chatbots. In chat, the user waits for a reply. In voice, both parties speak continuously. Partial utterances arrive every few seconds. Running a large language model on every fragment feels responsive in demos but collapses in production: API latency stacks, costs explode, and the pipeline falls behind the conversation.

Humans expect conversational responses within 300–500ms. When latency exceeds 800ms, the interaction feels mechanical and frustrating. For real-time voice agents and call assistance systems, latency is not a backend metric—it's the product.

We had to define a latency budget across four layers:

Our backend optimizes layers two through four. Audio moves from the client to our server and straight through to a streaming speech-to-text provider—no transcoding on our side. LLM processing and response generation are ours. The constraint we designed for is simple: the user does not pause for your LLM.

That constraint drove every architectural decision. We didn't try to make one model faster. We redesigned the pipeline so expensive work runs less often, on smaller inputs, with faster models.


Section 2 — Architecture Snapshot

Dual-stream sessions

Each live call opens two independent WebSocket sessions:

StreamSourcePurpose
MicUser's microphoneUser-side speech
SystemSystem audio captureAgent/tool-side speech

Each stream maintains its own session state, word buffer, streaming STT connection, and processing pipeline. The infrastructure pays for speaker separation with parallel pipelines when both streams are active—essential for accurate turn detection and interrupt handling.

Hot path vs cold path

During the call, the hot path is entirely in memory and Redis. PostgreSQL wakes up only when the session ends.

What we intentionally keep off the hot path:

  • Relational database writes (transcripts, responses, duration)
  • Slow "pro" model verification over the full call
  • Heavy context generation (custom examples may finish in the background after audio starts)
  • Usage-limit sync to the database (Redis hash during the call; DB sync every 30 minutes)

End-to-end flow on each spoken turn

  1. The client sends raw PCM audio at 16 kHz over WebSocket
  2. The server forwards bytes unchanged to streaming STT
  3. STT returns a final transcript (end-of-turn)
  4. Words accumulate in an in-memory buffer
  5. When buffer reaches threshold OR flush timer triggers (1000ms), process with LLM
  6. Generate response and push to client immediately
  7. Append transcript fragments to Redis
  8. On call end: merge Redis data, run slow verifier, write to PostgreSQL once

This is not a single LLM call. It is a multi-stage voice agent where each stage has its own latency profile.


Section 3 — The Six Decisions That Mattered Most

For each decision: the problem we faced, the approach we took, and the trade-off we accepted.

Decision 1: Word-window batching with overlap and debounced flush timer

Problem: Streaming STT emits short turns—often 5 to 15 words. Invoking an LLM on every turn means dozens of API calls per minute and unpredictable backlog when model latency spikes.

Approach: We batch transcript words into windows before processing. Two configurations work:

ConfigurationWindow SizeOverlapStrideUse Case
Low latency20 words5 words15 wordsFast response, short utterances
High context50 words20 words30 wordsLong conversations, more context

Overlap preserves context at boundaries so sentences split across windows are not lost.

Batching alone creates blind spots. A trailing phrase might never fill a full window before the speaker stops. So we add a flush timer of 1000ms. After the last word, if the buffer holds fewer than 50 words and silence holds for 1 second, we flush and process the remainder once.

We also skip duplicate windows: if overlap produces the same text as the last processed window, we do not call the model again.

Trade-off: We traded a few seconds of semantic freshness for predictable throughput. Responses may lag roughly 15–30 words of speech behind the live conversation—acceptable for voice agents.

Decision 2: Final transcripts only—never partials

Problem: AssemblyAI and similar STT providers emit both partial and final transcripts. Partial transcripts are evolving, incomplete, and change constantly. Processing them pollutes your word buffer and triggers unnecessary LLM calls.

Approach: We discard partial transcripts entirely. Only final transcripts (immutable, complete utterances) populate the word buffer.

This is critical: processing partials creates noise. A word might appear in a partial, disappear in the next, then reappear modified. Every partial trigger would be a cascade of unnecessary LLM calls, exploding latency and cost.

Trade-off: We wait slightly longer for final transcripts (typically 100–300ms more than partials). But we avoid 10× more LLM calls. The net effect is lower end-to-end latency.

Decision 3: Flush timer overrides buffer threshold

Problem: Waiting for the buffer to fill (e.g., 50 words) introduces artificial delays. A user might say a complete sentence in 8 words, then stop. If we wait for 50 words, we add 2–3 seconds of pointless latency.

Approach: The flush timer (1000ms) triggers processing regardless of buffer state. After the last word:

  • If buffer has 50 or more words: process immediately
  • If buffer has fewer than 50 words: wait 1000ms, then process whatever is there

This ensures:

  • No idle waiting for buffer to fill
  • Consistent response times (max 1000ms after last word)
  • Better handling of short utterances

Trade-off: We process smaller windows more often. But the LLM call count is still bounded by batching (vs. per-word), and the latency benefit is massive.

Decision 4: Two-speed model tiering

Problem: One model cannot be both fast enough for the live path and accurate enough for full-call reconciliation.

Approach: We run a two-speed LLM architecture:

StageModel tierWhen it runsLatency target
Real-time processingFast flash-class model (e.g., Gemini-flash-lite)Every processed window300–900 ms
Mid-call verifierLarge pro-class modelOnce at ~50% of callFire-and-forget
End-of-call verifierLarge pro-class modelSession cleanup onlyFire-and-forget

The real-time model (Gemini-flash-lite) is optimized for speed. The verifier models are more accurate but slower—they run asynchronously and never block the live pipeline.

The mid-call verifier loads the merged transcript from Redis and reconciles the conversation. It is launched fire-and-forget—never awaited—so a slow verification pass never blocks the next transcript turn.

Trade-off: Real-time processing may miss nuance that the verifier catches later. That is acceptable because the user is not waiting on that pass.

Decision 5: Redis write-behind

Problem: Writing transcripts and responses to PostgreSQL on every spoken turn would add disk latency and connection pool contention to the hot path.

Approach: During the call, all durable writes go to Redis:

  • Running per-stream transcript strings
  • Timestamped transcript fragments tagged mic or system
  • Response texts
  • Session usage time in a hash with a 1-hour TTL

PostgreSQL inserts happen once at cleanup, guarded by distributed locks so mic and system disconnects do not double-write the same call.

Trade-off: Redis is the source of truth until the call ends. A crash before cleanup could lose in-flight data—mitigated by treating live responses as ephemeral-real-time and post-call storage as best-effort durable. For voice agents, losing a response mid-call is worse than losing one after hang-up; the inverse is not true.

Decision 6: Parallel session bootstrap and background context loading

Problem: The first spoken word should not pay for database reads and context assembly.

Approach: Before audio flows, context loads run in parallel:

  • User/pre-built context (preferences, rules)
  • Domain metadata
  • In-memory call state initialized

Custom examples load from the database or generate via LLM in the background—not awaited before processing starts. Early windows may use defaults until custom content is ready.

On STT connect, we send up to 100 domain keyterms (50 characters each). Better transcription upstream prevents wasted LLM cycles on garbled input—a latency investment that does not show up in backend timers.

Trade-off: First-minute quality may be lower until background loading completes. We accept defaults early rather than block the hot path.

Supporting tactics (shorter but real)

  • Zero server-side audio processing. Binary audio buffers pass straight to STT. Every millisecond on audio middleware is stolen from understanding.
  • Pause as a circuit breaker. When the user pauses: stop forwarding audio, ignore STT messages, clear flush timers. Pause is load shedding, not polish.
  • JWT-only HTTP auth. HTTP routes verify tokens synchronously with no database lookup. WebSocket auth uses the same pattern plus a one-time usage check.
  • Per-stage instrumentation. Every session maintains a timing log: window readiness, processing duration, token counts, Redis operation times, and full window processing time. You cannot tune window sizes intelligently until you know where milliseconds go.
  • Parallel processing for dual streams. Mic and system audio are processed concurrently—neither stream blocks the other. This reduces end-to-end latency and improves interrupt handling.

Section 4 — What Still Hurts (Lessons Learned)

Latency work is never finished. These are the bottlenecks we know about:

IssueImpactPlanned fix
Serial pipelineProcessing + response run sequentially; total latency is sumParallelize response prep where possible
STT handler backpressureSlow model window delays subsequent turnsAdd backpressure queues
Dual-stream duplicationMic and system maintain independent buffers; LLM calls can doubleShared window coordination across streams
No streaming LLM outputResponses arrive only after full response completesToken streaming to client for perceived latency
Per-turn Redis writesTwo awaited Redis ops per turn (not batched)Batched Redis pipelines
Connect-time database queryUsage limit verification adds hundreds of milliseconds to connectOptimize SQL or cache

Section 5 — Latency Budget Breakdown

Here's where milliseconds are lost in typical voice agent stacks:

ComponentTypical LatencyOur Optimization
STT (AssemblyAI streaming)150–200 msStreaming API, final transcripts only
Buffer waiting200–500 msFlush timer (1000ms) overrides wait
LLM (Gemini-flash-lite)300–900 msFlash model, batched inputs
Response generation800–2,500 msFast model, conditional triggering
TTS75–150 msLow-latency provider
Total1,525–5,150 msTarget: 350–500 ms

Achievable target: With this architecture, we reach ~380–450ms end-to-end latency (from user speech to agent response).


Section 6 — Closing

Reducing latency in voice agents is systems design, not model shopping. A faster flagship model on every partial transcript still loses to a pipeline that runs LLMs less often, on smaller inputs, with faster models.

Our approach in one sentence: batch speech into windows with overlap, use final transcripts only, trigger with a flush timer (1000ms), process with a fast model, buffer in Redis, verify with a slow model when nobody is waiting.

If you are building in this space, measure your windows per minute, your average window size, and your flush timer trigger rate. Those three numbers tell you more than any single end-to-end latency metric.

We would like to hear how others size word windows, configure flush timers, and split hot and cold paths—especially in multi-stream setups. The best patterns in voice agents are still being discovered in production, not in benchmarks.


Appendix A — Timing Instrumentation Reference

Each live session records a call timing log:

EventUnitUse
Call start (first transcript)ISO timestampPipeline T0
Window ready (word range)word rangeWhen batch threshold met or timer triggers
Full window processingmsEnd-to-end per window
LLM processingms + tokensFast model cost
Response generationms + tokensPer window
Redis transcript storemsHot-path write latency
Redis fragment appendmsHot-path write latency

Illustrative timing ranges

StageIllustrative rangeNotes
Speech → 15 new words (low-lat stride)~8–20 s~120–150 wpm
Speech → 30 new words (standard stride)~15–45 s~120–150 wpm
LLM processing (flash-class, ~20 words)300–600 msLow-lat config
LLM processing (flash-class, ~50 words)300–900 msStandard config
Response generation (fast model)800–2,000 msPer window
Full window (LLM only, no response)400–1,200 msTypical steady-state
Full window (LLM + response)1,500–3,500 msWhen response generated
Redis append (per turn)1–10 msTwo ops per turn
Tail flush debounce1,000 ms fixedAfter last word
WebSocket client pushRTT-dependentUsually negligible

Suggested aggregates:

  • Median and p95 LLM processing ms per window
  • Median and p95 response ms
  • Flush timer trigger rate: % windows triggered by timer vs buffer full
  • Windows per minute per stream
  • Connect-to-first-response latency

Appendix B — Key Configuration Constants

ConstantLow-latency valueStandard value
Word window size20 words50 words
Window overlap5 words (stride 15)20 words (stride 30)
Flush timer1,000 ms1,000 ms
STT audio format16 kHz PCM signed 16-bit little-endianSame
STT keytermsUp to 100 terms, 50 characters eachSame
Time limit check interval1 minute (configurable)Same
Redis-to-DB usage sync30 minutes (configurable)Same
Redis session time TTL3,600 secondsSame

Built with love by Sidhant Singh Rathore