What This Example Shows
- How to treat voice as I/O — Vapi runs the realtime audio loop, Swarms runs the brain
- A FastAPI webhook that receives caller transcripts and fires a
MultiAgentRouterswarm - Four specialist voice agents (Booking, Billing, Emergency Triage, FAQ) tuned for spoken replies
- An end-of-call summary sub-agent that writes a structured record to Airtable or HubSpot
- A real per-turn latency budget and per-call cost breakdown you can quote to a customer
Vapi, Retell, and 11Labs handle the realtime audio loop — STT, TTS, barge-in detection, and the WebRTC plumbing. Swarms handles the brains: routing, specialist reasoning, structured output, and CRM logging. This pattern works with any voice infrastructure that supports a webhook on user turn.
Why This Matters
A single voice receptionist costs around $35K/yr fully loaded and works 40 hours a week. A voice agent stack runs 24/7 for roughly $0.30 per call. Most voice startups shipping today wire a single LLM call to each turn and call it done — that hits a ceiling fast because one prompt can’t be specialist, structured, and fast all at once. A swarm gives you proper intent routing, parallel reasoning when you need it, and CRM logging on the same turn budget. You ship a real product, not a chatbot with a phone number.The Architecture
Step 1: Setup
Step 2: Configure the Vapi Webhook
In your Vapi assistant config, point theserverUrl at your FastAPI endpoint. Vapi will POST a payload on every user turn and again at end-of-call. The exact schema is documented by Vapi — what matters for this pattern is that the payload carries the caller’s transcript, a stable call_id, and the caller’s phone number.
The Vapi assistant’s own model is intentionally minimal — it just keeps the audio loop alive. The actual reasoning happens in your FastAPI handler when Vapi calls back with the user transcript. This is the pattern voice-AI builders use when they want full control over routing and tools.
Step 3: Define the Specialist Agents
These agents are tuned for speech, not text. Short sentences. No bullet lists. No markdown. The caller is on a phone — they can’t see your formatting.Step 4: The MultiAgentRouter Endpoint
This is the per-turn handler. Vapi posts the running transcript on every user turn; we send it through the router and return the reply in the format Vapi expects.The
timeout=20 matters. If Swarms takes longer than the turn budget, Vapi will stall the caller — pick fast models (gpt-4.1-mini, claude-haiku-4.5) for routing and only escalate to a larger model on flagged calls.Step 5: End-of-Call Summary → CRM
When the caller hangs up, Vapi sends anend-of-call-report event with the full transcript, duration, and any tool calls made during the call. Route that to a single summary agent and POST the result to your CRM.
write_to_airtable for a POST to https://api.hubapi.com/crm/v3/objects/calls with an OAuth bearer — the structured-summary shape is the same.
Latency Budget
A real-feeling voice call needs round-trip turn latency under ~1.5 seconds. Here is where the time goes:| Stage | Typical latency | Owner |
|---|---|---|
| STT (Deepgram / Whisper) | ~150–250 ms | Vapi |
| Webhook network round trip | ~50–100 ms | Your infra |
| Swarms router decision | ~200–400 ms | Swarms (gpt-4.1-mini) |
| Specialist agent reply | ~300–600 ms | Swarms (gpt-4.1-mini) |
| TTS first audio chunk | ~150–250 ms | Vapi |
| Total to first audio out | ~850 ms – 1.6 s |
gpt-4.1-mini or claude-haiku-4.5. Only escalate to gpt-4.1 or claude-sonnet-4.5 on flagged calls (emergencies, high-value accounts). Cap max_tokens aggressively — voice replies are short anyway, and a 200-token cap shaves real milliseconds off the response.
Real Cost
A representative five-turn call (caller books an appointment):| Line item | Cost per call |
|---|---|
| Vapi infrastructure (STT + TTS + telephony) | ~$0.10 |
Swarms turns × 5 (router + specialist, gpt-4.1-mini) | ~$0.04 |
| End-of-call summary agent | ~$0.01 |
| Total | ~$0.15 |
| Scenario | Annual cost | Coverage | Calls handled |
|---|---|---|---|
| One voice receptionist | ~$35,000 | 40 hr/wk | ~8,000 |
| Voice agent stack at $0.15/call | ~$1,200 | 24/7 | ~8,000 |
| Same stack at high volume (50k calls) | ~$7,500 | 24/7 | ~50,000 |
Next Steps
- Multi-Agent Router Example — the routing pattern that powers the turn handler
- Agent Handoffs Example — pre-defined specialist routing inside a single agent call, for tighter latency budgets
- Structured Outputs — the JSON-output pattern the summary agent uses for CRM writes