How We Cut Voice-Agent Latency to Sub-500ms: A Production Architecture

A case study in building real-time voice agents: batching, model tiering, flush timers, and keeping the hot path off disk.

Section 1 — Why Latency Is the Product

Voice agents fail differently from chatbots. In chat, the user waits for a reply. In voice, both parties speak continuously. Partial utterances arrive every few seconds. Running a large language model on every fragment feels responsive in demos but collapses in production: API latency stacks, costs explode, and the pipeline falls behind the conversation.

Humans expect conversational responses within 300–500ms. When latency exceeds 800ms, the interaction feels mechanical and frustrating. For real-time voice agents and call assistance systems, latency is not a backend metric—it's the product.

We had to define a latency budget across four layers:

Our backend optimizes layers two through four. Audio moves from the client to our server and straight through to a streaming speech-to-text provider—no transcoding on our side. LLM processing and response generation are ours. The constraint we designed for is simple: the user does not pause for your LLM.

That constraint drove every architectural decision. We didn't try to make one model faster. We redesigned the pipeline so expensive work runs less often, on smaller inputs, with faster models.

Section 2 — Architecture Snapshot

Dual-stream sessions

Each live call opens two independent WebSocket sessions:

Stream	Source	Purpose
Mic	User's microphone	User-side speech
System	System audio capture	Agent/tool-side speech

Each stream maintains its own session state, word buffer, streaming STT connection, and processing pipeline. The infrastructure pays for speaker separation with parallel pipelines when both streams are active—essential for accurate turn detection and interrupt handling.

Hot path vs cold path

During the call, the hot path is entirely in memory and Redis. PostgreSQL wakes up only when the session ends.

What we intentionally keep off the hot path:

–Relational database writes (transcripts, responses, duration)
–Slow "pro" model verification over the full call
–Heavy context generation (custom examples may finish in the background after audio starts)
–Usage-limit sync to the database (Redis hash during the call; DB sync every 30 minutes)

End-to-end flow on each spoken turn

–The client sends raw PCM audio at 16 kHz over WebSocket
–The server forwards bytes unchanged to streaming STT
–STT returns a final transcript (end-of-turn)
–Words accumulate in an in-memory buffer
–When buffer reaches threshold OR flush timer triggers (1000ms), process with LLM
–Generate response and push to client immediately
–Append transcript fragments to Redis
–On call end: merge Redis data, run slow verifier, write to PostgreSQL once

This is not a single LLM call. It is a multi-stage voice agent where each stage has its own latency profile.

Section 3 — The Six Decisions That Mattered Most

For each decision: the problem we faced, the approach we took, and the trade-off we accepted.

Decision 1: Word-window batching with overlap and debounced flush timer

Problem: Streaming STT emits short turns—often 5 to 15 words. Invoking an LLM on every turn means dozens of API calls per minute and unpredictable backlog when model latency spikes.

Approach: We batch transcript words into windows before processing. Two configurations work:

Configuration	Window Size	Overlap	Stride	Use Case
Low latency	20 words	5 words	15 words	Fast response, short utterances
High context	50 words	20 words	30 words	Long conversations, more context

Overlap preserves context at boundaries so sentences split across windows are not lost.

Batching alone creates blind spots. A trailing phrase might never fill a full window before the speaker stops. So we add a flush timer of 1000ms. After the last word, if the buffer holds fewer than 50 words and silence holds for 1 second, we flush and process the remainder once.

We also skip duplicate windows: if overlap produces the same text as the last processed window, we do not call the model again.

Trade-off: We traded a few seconds of semantic freshness for predictable throughput. Responses may lag roughly 15–30 words of speech behind the live conversation—acceptable for voice agents.

Decision 2: Final transcripts only—never partials

Problem: AssemblyAI and similar STT providers emit both partial and final transcripts. Partial transcripts are evolving, incomplete, and change constantly. Processing them pollutes your word buffer and triggers unnecessary LLM calls.

Approach: We discard partial transcripts entirely. Only final transcripts (immutable, complete utterances) populate the word buffer.

This is critical: processing partials creates noise. A word might appear in a partial, disappear in the next, then reappear modified. Every partial trigger would be a cascade of unnecessary LLM calls, exploding latency and cost.

Trade-off: We wait slightly longer for final transcripts (typically 100–300ms more than partials). But we avoid 10× more LLM calls. The net effect is lower end-to-end latency.

Decision 3: Flush timer overrides buffer threshold

Problem: Waiting for the buffer to fill (e.g., 50 words) introduces artificial delays. A user might say a complete sentence in 8 words, then stop. If we wait for 50 words, we add 2–3 seconds of pointless latency.

Approach: The flush timer (1000ms) triggers processing regardless of buffer state. After the last word:

–If buffer has 50 or more words: process immediately
–If buffer has fewer than 50 words: wait 1000ms, then process whatever is there

This ensures:

–No idle waiting for buffer to fill
–Consistent response times (max 1000ms after last word)
–Better handling of short utterances

Trade-off: We process smaller windows more often. But the LLM call count is still bounded by batching (vs. per-word), and the latency benefit is massive.

Decision 4: Two-speed model tiering

Problem: One model cannot be both fast enough for the live path and accurate enough for full-call reconciliation.

Approach: We run a two-speed LLM architecture:

Stage	Model tier	When it runs	Latency target
Real-time processing	Fast flash-class model (e.g., Gemini-flash-lite)	Every processed window	300–900 ms
Mid-call verifier	Large pro-class model	Once at ~50% of call	Fire-and-forget
End-of-call verifier	Large pro-class model	Session cleanup only	Fire-and-forget

The real-time model (Gemini-flash-lite) is optimized for speed. The verifier models are more accurate but slower—they run asynchronously and never block the live pipeline.

The mid-call verifier loads the merged transcript from Redis and reconciles the conversation. It is launched fire-and-forget—never awaited—so a slow verification pass never blocks the next transcript turn.

Trade-off: Real-time processing may miss nuance that the verifier catches later. That is acceptable because the user is not waiting on that pass.

Decision 5: Redis write-behind

Problem: Writing transcripts and responses to PostgreSQL on every spoken turn would add disk latency and connection pool contention to the hot path.

Approach: During the call, all durable writes go to Redis:

–Running per-stream transcript strings
–Timestamped transcript fragments tagged mic or system
–Response texts
–Session usage time in a hash with a 1-hour TTL

PostgreSQL inserts happen once at cleanup, guarded by distributed locks so mic and system disconnects do not double-write the same call.

Trade-off: Redis is the source of truth until the call ends. A crash before cleanup could lose in-flight data—mitigated by treating live responses as ephemeral-real-time and post-call storage as best-effort durable. For voice agents, losing a response mid-call is worse than losing one after hang-up; the inverse is not true.

Decision 6: Parallel session bootstrap and background context loading

Problem: The first spoken word should not pay for database reads and context assembly.

Approach: Before audio flows, context loads run in parallel:

–User/pre-built context (preferences, rules)
–Domain metadata
–In-memory call state initialized

Custom examples load from the database or generate via LLM in the background—not awaited before processing starts. Early windows may use defaults until custom content is ready.

On STT connect, we send up to 100 domain keyterms (50 characters each). Better transcription upstream prevents wasted LLM cycles on garbled input—a latency investment that does not show up in backend timers.

Trade-off: First-minute quality may be lower until background loading completes. We accept defaults early rather than block the hot path.

Supporting tactics (shorter but real)

–Zero server-side audio processing. Binary audio buffers pass straight to STT. Every millisecond on audio middleware is stolen from understanding.
–Pause as a circuit breaker. When the user pauses: stop forwarding audio, ignore STT messages, clear flush timers. Pause is load shedding, not polish.
–JWT-only HTTP auth. HTTP routes verify tokens synchronously with no database lookup. WebSocket auth uses the same pattern plus a one-time usage check.
–Per-stage instrumentation. Every session maintains a timing log: window readiness, processing duration, token counts, Redis operation times, and full window processing time. You cannot tune window sizes intelligently until you know where milliseconds go.
–Parallel processing for dual streams. Mic and system audio are processed concurrently—neither stream blocks the other. This reduces end-to-end latency and improves interrupt handling.

Section 4 — What Still Hurts (Lessons Learned)

Latency work is never finished. These are the bottlenecks we know about:

Issue	Impact	Planned fix
Serial pipeline	Processing + response run sequentially; total latency is sum	Parallelize response prep where possible
STT handler backpressure	Slow model window delays subsequent turns	Add backpressure queues
Dual-stream duplication	Mic and system maintain independent buffers; LLM calls can double	Shared window coordination across streams
No streaming LLM output	Responses arrive only after full response completes	Token streaming to client for perceived latency
Per-turn Redis writes	Two awaited Redis ops per turn (not batched)	Batched Redis pipelines
Connect-time database query	Usage limit verification adds hundreds of milliseconds to connect	Optimize SQL or cache

Section 5 — Latency Budget Breakdown

Here's where milliseconds are lost in typical voice agent stacks:

Component	Typical Latency	Our Optimization
STT (AssemblyAI streaming)	150–200 ms	Streaming API, final transcripts only
Buffer waiting	200–500 ms	Flush timer (1000ms) overrides wait
LLM (Gemini-flash-lite)	300–900 ms	Flash model, batched inputs
Response generation	800–2,500 ms	Fast model, conditional triggering
TTS	75–150 ms	Low-latency provider
Total	1,525–5,150 ms	Target: 350–500 ms

Achievable target: With this architecture, we reach ~380–450ms end-to-end latency (from user speech to agent response).

Section 6 — Closing

Reducing latency in voice agents is systems design, not model shopping. A faster flagship model on every partial transcript still loses to a pipeline that runs LLMs less often, on smaller inputs, with faster models.

Our approach in one sentence: batch speech into windows with overlap, use final transcripts only, trigger with a flush timer (1000ms), process with a fast model, buffer in Redis, verify with a slow model when nobody is waiting.

If you are building in this space, measure your windows per minute, your average window size, and your flush timer trigger rate. Those three numbers tell you more than any single end-to-end latency metric.

We would like to hear how others size word windows, configure flush timers, and split hot and cold paths—especially in multi-stream setups. The best patterns in voice agents are still being discovered in production, not in benchmarks.

Appendix A — Timing Instrumentation Reference

Each live session records a call timing log:

Event	Unit	Use
Call start (first transcript)	ISO timestamp	Pipeline T0
Window ready (word range)	word range	When batch threshold met or timer triggers
Full window processing	ms	End-to-end per window
LLM processing	ms + tokens	Fast model cost
Response generation	ms + tokens	Per window
Redis transcript store	ms	Hot-path write latency
Redis fragment append	ms	Hot-path write latency

Illustrative timing ranges

Stage	Illustrative range	Notes
Speech → 15 new words (low-lat stride)	~8–20 s	~120–150 wpm
Speech → 30 new words (standard stride)	~15–45 s	~120–150 wpm
LLM processing (flash-class, ~20 words)	300–600 ms	Low-lat config
LLM processing (flash-class, ~50 words)	300–900 ms	Standard config
Response generation (fast model)	800–2,000 ms	Per window
Full window (LLM only, no response)	400–1,200 ms	Typical steady-state
Full window (LLM + response)	1,500–3,500 ms	When response generated
Redis append (per turn)	1–10 ms	Two ops per turn
Tail flush debounce	1,000 ms fixed	After last word
WebSocket client push	RTT-dependent	Usually negligible

Suggested aggregates:

–Median and p95 LLM processing ms per window
–Median and p95 response ms
–Flush timer trigger rate: % windows triggered by timer vs buffer full
–Windows per minute per stream
–Connect-to-first-response latency

Appendix B — Key Configuration Constants

Constant	Low-latency value	Standard value
Word window size	20 words	50 words
Window overlap	5 words (stride 15)	20 words (stride 30)
Flush timer	1,000 ms	1,000 ms
STT audio format	16 kHz PCM signed 16-bit little-endian	Same
STT keyterms	Up to 100 terms, 50 characters each	Same
Time limit check interval	1 minute (configurable)	Same
Redis-to-DB usage sync	30 minutes (configurable)	Same
Redis session time TTL	3,600 seconds	Same