Show HN: I built a sub-500ms latency voice agent from scratch

📝 Discussion Summary (Click to expand)

1. Latency & Turn‑Taking is the core UX killer

“Voice is an orchestration problem” – boznz
“The median delay between human speakers during a conversation is 0 ms” – jedberg
“The warm TTS websocket pool saving ~300 ms” – Carmack

The discussion repeatedly points out that the first useful output (TTFT) dominates perceived speed, and that semantic end‑of‑turn detection (Deepgram Flux, OpenAI semantic VAD, Pipecat smart‑turn) is the key to avoiding premature interruptions.

2. End‑to‑End vs Cascading STT→LLM→TTS

“STT → LLM → TTS is a dead end. The future is end‑to‑end.” – modeless
“The cascading model approach is much more amenable to specialization and auditability.” – cootsnuck

While some argue for a single end‑to‑end model, many participants defend the modular pipeline for its flexibility, observability, and the ability to plug in best‑in‑class components.

3. Provider & Benchmarking Wars

“Soniox Real‑time … always works better than VAD.” – lukax
“Deepgram’s Flux … is a higher‑level abstraction than VAD.” – nicktikhonov
“Soniox wins the independent benchmarks done by Daily.” – lukax

Users compare services (Soniox, Deepgram, OpenAI, Gemini, Cerebras, Groq) on latency, accuracy, and cost, often citing benchmark links and personal experience.

4. Practical Deployment & Cost Reality

“You need to handle the race condition where the system already committed to a response path that’s now invalid.” – evara‑ai
“The hard part is running a model that can detect a “Hey agent” on‑device.” – nicktikhonov
“The cost of running a voice assistant for millions is huge.” – jedberg

Participants discuss edge‑colocation, local vs cloud inference, tool‑call orchestration, guardrails for safety, and the economics of scaling voice agents to production workloads.

🚀 Project Ideas

Speculative Voice Assistant (SVA)

Summary

Reduces perceived latency by pre‑generating filler words and partial responses while the main LLM processes the user’s full utterance.
Implements a lightweight “speculative LLM” that predicts the next few tokens and streams them immediately, then seamlessly switches to the full‑scale LLM output.
Solves the frustration of “waiting for the assistant to finish thinking” and the “barge‑in” problem highlighted by many HN users.

Details

Key	Value
Target Audience	Developers building conversational agents, telephony platforms, and voice‑enabled SaaS products.
Core Feature	Speculative token streaming + dynamic switch‑over to full LLM output.
Tech Stack	Python, FastAPI, OpenAI/Anthropic API, WebSocket streaming, optional local Llama‑3.1‑8B for speculation.
Difficulty	Medium
Monetization	Revenue‑ready: $0.01 per 1k speculative tokens + $0.02 per 1k full tokens (usage‑based).

Notes

Users complained about “waiting for the assistant to finish thinking” and “barge‑in” issues; SVA addresses both by giving instant filler (“Hmm…”) and then the full answer.
The speculatively generated filler can be customized per domain, improving naturalness (see user “brody_hamer” suggestion).
The service can be deployed on edge servers to reduce geographic latency, directly tackling the “geography matters” pain point.

Low‑Latency On‑Device Voice Agent Stack (OLVAS)

Summary

A ready‑to‑run Docker image that bundles on‑device wake‑word detection, Whisper‑based STT, a small Llama‑3.1‑8B for inference, and a low‑latency TTS engine (e.g., Smallest.ai or Apple Silicon MLX).
Eliminates cloud round‑trip latency, enabling sub‑second end‑to‑end voice interactions on consumer hardware (Raspberry Pi, Mac Mini M4, etc.).
Addresses the “I want a local, low‑cost solution” frustration expressed by many commenters.

Details

Key	Value
Target Audience	Hobbyists, makers, and small businesses wanting local voice agents.
Core Feature	Fully local, low‑latency voice pipeline with optional speculative filler.
Tech Stack	Docker, Whisper (CPU/GPU), Llama‑3.1‑8B, Smallest.ai TTS, Porcupine wake‑word, ALSA audio.
Difficulty	Medium
Monetization	Hobby (open source) with optional paid support plan.

Notes

“bronco21016” and “suganesh95” highlighted the need for on‑device wake‑word and low‑latency pipelines; OLVAS delivers that out of the box.
The stack can be extended with Deepgram Flux or Soniox for advanced semantic VAD if desired.
Ideal for telephony or IoT devices where network latency is a bottleneck.

Voice Agent Orchestration Toolkit (VAOT)

Summary

An open‑source, plug‑in‑based toolkit that abstracts turn detection, VAD, endpointing, tool‑call orchestration, and graceful barge‑in handling.
Provides a unified API for developers to swap in different STT, LLM, and TTS back‑ends without rewriting orchestration logic.
Meets the demand for “plug‑in architecture” and “production‑ready” solutions mentioned by many HN users.

Details

Key	Value
Target Audience	Enterprise developers, voice‑assistant platform builders.
Core Feature	Modular orchestration engine with plug‑in adapters for Deepgram, Soniox, OpenAI, Llama, TTS, and tool‑call middleware.
Tech Stack	Go (for performance), gRPC, Docker, plugin system, optional WebSocket UI.
Difficulty	High
Monetization	Revenue‑ready: Enterprise licensing + optional managed hosting.

Notes

“cootsnuck” and “evara‑ai” emphasized the need for robust orchestration; VAOT provides that with built‑in barge‑in cancellation and race‑condition handling.
The toolkit includes a “smart‑turn” module inspired by Pipecat’s smart‑turn‑v3, but with improved accuracy for noisy environments.
Supports multi‑tenant deployments and can be integrated with Twilio, SIP, or WebRTC.

Voice Agent Benchmark Suite (VABS)

Summary

A comprehensive benchmarking framework that measures end‑to‑end latency, turn‑taking accuracy, VAD precision, and cost per interaction for various provider combinations (Deepgram Flux, Soniox, OpenAI, local models).
Provides a real‑time dashboard and downloadable reports, enabling teams to make data‑driven decisions about provider selection and architecture tuning.
Addresses the “I want to compare providers” and “latency breakdown” pain points expressed throughout the discussion.

Details

Key	Value
Target Audience	Voice‑AI researchers, platform engineers, product managers.
Core Feature	Automated test harness, latency profiling, turn‑taking metrics, cost calculator.
Tech Stack	Python, pytest, Locust, Grafana, Prometheus, Docker Compose.
Difficulty	Medium
Monetization	Revenue‑ready: $99/month for enterprise dashboards + $0.01 per test run.

Notes

“veteran” and “red2awn” highlighted the lack of reliable open‑source turn‑taking benchmarks; VABS fills that gap.
The suite can be run locally or in CI to validate new models or infrastructure changes before deployment.
Includes a “speculative filler” test case to quantify the perceived latency improvement.

Show HN: I built a sub-500ms latency voice agent from scratch

🚀 Project Ideas

Speculative Voice Assistant (SVA)

Summary

Details

Notes

Low‑Latency On‑Device Voice Agent Stack (OLVAS)

Summary

Details

Notes

Voice Agent Orchestration Toolkit (VAOT)

Summary

Details

Notes

Voice Agent Benchmark Suite (VABS)

Summary

Details

Notes

Read Later