
How We Cut Voice-Agent Latency to Sub-500ms: A Production Architecture
A case study in building real-time voice agents: batching, model tiering, flush timers, and keeping the hot path off disk.
Section 1 — Why Latency Is the Product
Voice agents fail differently from chatbots. In chat, the user waits for a reply. In voice, both parties speak continuously. Partial utterances arrive every few seconds. Running a large language model on every fragment feels responsive in demos but collapses in production: API latency stacks, costs explode, and the pipeline falls behind the conversation.
Humans expect conversational responses within 300–500ms. When latency exceeds 800ms, the interaction feels mechanical and frustrating. For real-time voice agents and call assistance systems, latency is not a backend metric—it's the product.
We had to define a latency budget across four layers:
Our backend optimizes layers two through four. Audio moves from the client to our server and straight through to a streaming speech-to-text provider—no transcoding on our side. LLM processing and response generation are ours. The constraint we designed for is simple: the user does not pause for your LLM.
That constraint drove every architectural decision. We didn't try to make one model faster. We redesigned the pipeline so expensive work runs less often, on smaller inputs, with faster models.
Section 2 — Architecture Snapshot
Dual-stream sessions
Each live call opens two independent WebSocket sessions:
| Stream | Source | Purpose |
|---|---|---|
| Mic | User's microphone | User-side speech |
| System | System audio capture | Agent/tool-side speech |
Each stream maintains its own session state, word buffer, streaming STT connection, and processing pipeline. The infrastructure pays for speaker separation with parallel pipelines when both streams are active—essential for accurate turn detection and interrupt handling.
Hot path vs cold path
During the call, the hot path is entirely in memory and Redis. PostgreSQL wakes up only when the session ends.
What we intentionally keep off the hot path:
- –Relational database writes (transcripts, responses, duration)
- –Slow "pro" model verification over the full call
- –Heavy context generation (custom examples may finish in the background after audio starts)
- –Usage-limit sync to the database (Redis hash during the call; DB sync every 30 minutes)
End-to-end flow on each spoken turn
- –The client sends raw PCM audio at 16 kHz over WebSocket
- –The server forwards bytes unchanged to streaming STT
- –STT returns a final transcript (end-of-turn)
- –Words accumulate in an in-memory buffer
- –When buffer reaches threshold OR flush timer triggers (1000ms), process with LLM
- –Generate response and push to client immediately
- –Append transcript fragments to Redis
- –On call end: merge Redis data, run slow verifier, write to PostgreSQL once
This is not a single LLM call. It is a multi-stage voice agent where each stage has its own latency profile.
Section 3 — The Six Decisions That Mattered Most
For each decision: the problem we faced, the approach we took, and the trade-off we accepted.
Decision 1: Word-window batching with overlap and debounced flush timer
Problem: Streaming STT emits short turns—often 5 to 15 words. Invoking an LLM on every turn means dozens of API calls per minute and unpredictable backlog when model latency spikes.
Approach: We batch transcript words into windows before processing. Two configurations work:
| Configuration | Window Size | Overlap | Stride | Use Case |
|---|---|---|---|---|
| Low latency | 20 words | 5 words | 15 words | Fast response, short utterances |
| High context | 50 words | 20 words | 30 words | Long conversations, more context |
Overlap preserves context at boundaries so sentences split across windows are not lost.
Batching alone creates blind spots. A trailing phrase might never fill a full window before the speaker stops. So we add a flush timer of 1000ms. After the last word, if the buffer holds fewer than 50 words and silence holds for 1 second, we flush and process the remainder once.
We also skip duplicate windows: if overlap produces the same text as the last processed window, we do not call the model again.
Trade-off: We traded a few seconds of semantic freshness for predictable throughput. Responses may lag roughly 15–30 words of speech behind the live conversation—acceptable for voice agents.
Decision 2: Final transcripts only—never partials
Problem: AssemblyAI and similar STT providers emit both partial and final transcripts. Partial transcripts are evolving, incomplete, and change constantly. Processing them pollutes your word buffer and triggers unnecessary LLM calls.
Approach: We discard partial transcripts entirely. Only final transcripts (immutable, complete utterances) populate the word buffer.
This is critical: processing partials creates noise. A word might appear in a partial, disappear in the next, then reappear modified. Every partial trigger would be a cascade of unnecessary LLM calls, exploding latency and cost.
Trade-off: We wait slightly longer for final transcripts (typically 100–300ms more than partials). But we avoid 10× more LLM calls. The net effect is lower end-to-end latency.
Decision 3: Flush timer overrides buffer threshold
Problem: Waiting for the buffer to fill (e.g., 50 words) introduces artificial delays. A user might say a complete sentence in 8 words, then stop. If we wait for 50 words, we add 2–3 seconds of pointless latency.
Approach: The flush timer (1000ms) triggers processing regardless of buffer state. After the last word:
- –If buffer has 50 or more words: process immediately
- –If buffer has fewer than 50 words: wait 1000ms, then process whatever is there
This ensures:
- –No idle waiting for buffer to fill
- –Consistent response times (max 1000ms after last word)
- –Better handling of short utterances
Trade-off: We process smaller windows more often. But the LLM call count is still bounded by batching (vs. per-word), and the latency benefit is massive.
Decision 4: Two-speed model tiering
Problem: One model cannot be both fast enough for the live path and accurate enough for full-call reconciliation.
Approach: We run a two-speed LLM architecture:
| Stage | Model tier | When it runs | Latency target |
|---|---|---|---|
| Real-time processing | Fast flash-class model (e.g., Gemini-flash-lite) | Every processed window | 300–900 ms |
| Mid-call verifier | Large pro-class model | Once at ~50% of call | Fire-and-forget |
| End-of-call verifier | Large pro-class model | Session cleanup only | Fire-and-forget |
The real-time model (Gemini-flash-lite) is optimized for speed. The verifier models are more accurate but slower—they run asynchronously and never block the live pipeline.
The mid-call verifier loads the merged transcript from Redis and reconciles the conversation. It is launched fire-and-forget—never awaited—so a slow verification pass never blocks the next transcript turn.
Trade-off: Real-time processing may miss nuance that the verifier catches later. That is acceptable because the user is not waiting on that pass.
Decision 5: Redis write-behind
Problem: Writing transcripts and responses to PostgreSQL on every spoken turn would add disk latency and connection pool contention to the hot path.
Approach: During the call, all durable writes go to Redis:
- –Running per-stream transcript strings
- –Timestamped transcript fragments tagged mic or system
- –Response texts
- –Session usage time in a hash with a 1-hour TTL
PostgreSQL inserts happen once at cleanup, guarded by distributed locks so mic and system disconnects do not double-write the same call.
Trade-off: Redis is the source of truth until the call ends. A crash before cleanup could lose in-flight data—mitigated by treating live responses as ephemeral-real-time and post-call storage as best-effort durable. For voice agents, losing a response mid-call is worse than losing one after hang-up; the inverse is not true.
Decision 6: Parallel session bootstrap and background context loading
Problem: The first spoken word should not pay for database reads and context assembly.
Approach: Before audio flows, context loads run in parallel:
- –User/pre-built context (preferences, rules)
- –Domain metadata
- –In-memory call state initialized
Custom examples load from the database or generate via LLM in the background—not awaited before processing starts. Early windows may use defaults until custom content is ready.
On STT connect, we send up to 100 domain keyterms (50 characters each). Better transcription upstream prevents wasted LLM cycles on garbled input—a latency investment that does not show up in backend timers.
Trade-off: First-minute quality may be lower until background loading completes. We accept defaults early rather than block the hot path.
Supporting tactics (shorter but real)
- –Zero server-side audio processing. Binary audio buffers pass straight to STT. Every millisecond on audio middleware is stolen from understanding.
- –Pause as a circuit breaker. When the user pauses: stop forwarding audio, ignore STT messages, clear flush timers. Pause is load shedding, not polish.
- –JWT-only HTTP auth. HTTP routes verify tokens synchronously with no database lookup. WebSocket auth uses the same pattern plus a one-time usage check.
- –Per-stage instrumentation. Every session maintains a timing log: window readiness, processing duration, token counts, Redis operation times, and full window processing time. You cannot tune window sizes intelligently until you know where milliseconds go.
- –Parallel processing for dual streams. Mic and system audio are processed concurrently—neither stream blocks the other. This reduces end-to-end latency and improves interrupt handling.
Section 4 — What Still Hurts (Lessons Learned)
Latency work is never finished. These are the bottlenecks we know about:
| Issue | Impact | Planned fix |
|---|---|---|
| Serial pipeline | Processing + response run sequentially; total latency is sum | Parallelize response prep where possible |
| STT handler backpressure | Slow model window delays subsequent turns | Add backpressure queues |
| Dual-stream duplication | Mic and system maintain independent buffers; LLM calls can double | Shared window coordination across streams |
| No streaming LLM output | Responses arrive only after full response completes | Token streaming to client for perceived latency |
| Per-turn Redis writes | Two awaited Redis ops per turn (not batched) | Batched Redis pipelines |
| Connect-time database query | Usage limit verification adds hundreds of milliseconds to connect | Optimize SQL or cache |
Section 5 — Latency Budget Breakdown
Here's where milliseconds are lost in typical voice agent stacks:
| Component | Typical Latency | Our Optimization |
|---|---|---|
| STT (AssemblyAI streaming) | 150–200 ms | Streaming API, final transcripts only |
| Buffer waiting | 200–500 ms | Flush timer (1000ms) overrides wait |
| LLM (Gemini-flash-lite) | 300–900 ms | Flash model, batched inputs |
| Response generation | 800–2,500 ms | Fast model, conditional triggering |
| TTS | 75–150 ms | Low-latency provider |
| Total | 1,525–5,150 ms | Target: 350–500 ms |
Achievable target: With this architecture, we reach ~380–450ms end-to-end latency (from user speech to agent response).
Section 6 — Closing
Reducing latency in voice agents is systems design, not model shopping. A faster flagship model on every partial transcript still loses to a pipeline that runs LLMs less often, on smaller inputs, with faster models.
Our approach in one sentence: batch speech into windows with overlap, use final transcripts only, trigger with a flush timer (1000ms), process with a fast model, buffer in Redis, verify with a slow model when nobody is waiting.
If you are building in this space, measure your windows per minute, your average window size, and your flush timer trigger rate. Those three numbers tell you more than any single end-to-end latency metric.
We would like to hear how others size word windows, configure flush timers, and split hot and cold paths—especially in multi-stream setups. The best patterns in voice agents are still being discovered in production, not in benchmarks.
Appendix A — Timing Instrumentation Reference
Each live session records a call timing log:
| Event | Unit | Use |
|---|---|---|
| Call start (first transcript) | ISO timestamp | Pipeline T0 |
| Window ready (word range) | word range | When batch threshold met or timer triggers |
| Full window processing | ms | End-to-end per window |
| LLM processing | ms + tokens | Fast model cost |
| Response generation | ms + tokens | Per window |
| Redis transcript store | ms | Hot-path write latency |
| Redis fragment append | ms | Hot-path write latency |
Illustrative timing ranges
| Stage | Illustrative range | Notes |
|---|---|---|
| Speech → 15 new words (low-lat stride) | ~8–20 s | ~120–150 wpm |
| Speech → 30 new words (standard stride) | ~15–45 s | ~120–150 wpm |
| LLM processing (flash-class, ~20 words) | 300–600 ms | Low-lat config |
| LLM processing (flash-class, ~50 words) | 300–900 ms | Standard config |
| Response generation (fast model) | 800–2,000 ms | Per window |
| Full window (LLM only, no response) | 400–1,200 ms | Typical steady-state |
| Full window (LLM + response) | 1,500–3,500 ms | When response generated |
| Redis append (per turn) | 1–10 ms | Two ops per turn |
| Tail flush debounce | 1,000 ms fixed | After last word |
| WebSocket client push | RTT-dependent | Usually negligible |
Suggested aggregates:
- –Median and p95 LLM processing ms per window
- –Median and p95 response ms
- –Flush timer trigger rate: % windows triggered by timer vs buffer full
- –Windows per minute per stream
- –Connect-to-first-response latency
Appendix B — Key Configuration Constants
| Constant | Low-latency value | Standard value |
|---|---|---|
| Word window size | 20 words | 50 words |
| Window overlap | 5 words (stride 15) | 20 words (stride 30) |
| Flush timer | 1,000 ms | 1,000 ms |
| STT audio format | 16 kHz PCM signed 16-bit little-endian | Same |
| STT keyterms | Up to 100 terms, 50 characters each | Same |
| Time limit check interval | 1 minute (configurable) | Same |
| Redis-to-DB usage sync | 30 minutes (configurable) | Same |
| Redis session time TTL | 3,600 seconds | Same |