What This Covers
- Why the dominant cost driver in a swarm is token volume, not raw per-token price
- A tiered architecture pattern: cheap triage and extraction workers feeding an expensive director
- A
tools_dictionaryrecipe that compresses upstream context into structured handoffs - Before/after numbers on a 5-agent business-research swarm using the live Swarms pricing model
- A checklist to apply on every swarm you ship
Why This Matters
Multi-agent systems blow up budgets in a way single-agent systems don’t: every agent re-reads the running context, every worker emits prose that becomes input tokens for the next worker, and a careless director that asks five reasoning agents to “give me everything you know” can 10× a bill without improving the answer. Swarms charges a uniform$6.50 per 1M input tokens and $18.50 per 1M output tokens on swarm completions — so the lever that actually moves the needle is how much text travels through the pipeline, not which logo is on the model card. This playbook is the architecture pattern we use internally and recommend to teams running production research, RAG, and analyst workflows.
The Pricing Reality
Swarm completions on Swarms are priced uniformly:| Component | Rate |
|---|---|
| Input tokens | $6.50 / 1M |
| Output tokens | $18.50 / 1M |
| Per-agent fee | $0.01 per agent per run |
| Overnight discount | 50% off token costs, 8 PM – 6 AM Pacific |
claude-haiku over claude-opus does not lower the per-token rate — the Swarms platform abstracts the model and bills you a flat blended price. The win comes from two architectural moves:
- Cheap workers emit terser outputs. Configure
max_tokensaggressively on triage and extraction agents (512–1,024) and reserve large budgets (4,096–8,192) for the synthesis agent that actually needs to reason. - Structured handoffs compress context. A worker that returns a 200-token JSON summary instead of a 5,000-token essay shrinks every downstream agent’s input bill proportionally.
The per-agent fee is small ($0.01) but stacks: a 10-agent swarm is
$0.10 of non-discountable overhead per run, on top of token costs. Prune agents that don’t contribute distinct value.The Anti-Pattern: All-Frontier, Verbose Workers
Here is the swarm most teams ship first. Five reasoning-grade workers, all with highmax_tokens, all producing long-form prose that the next agent has to re-read.
What this run actually costs
A realistic single execution of this swarm produces something like:- Input tokens: ~30,000 (each downstream agent re-reads upstream prose)
- Output tokens: ~12,000 (each worker fills its 6k–8k ceiling)
- Agents: 5
The Pattern: Tiered Workers + Compressed Handoffs
The fix is two changes:- Force each worker to emit a structured summary, not a long memo, via a tight
max_tokensand a system prompt that asks for JSON-shaped output. - Reserve the high-token budget for the final synthesizer, which is the only agent the user actually reads.
What this run actually costs
- Input tokens: ~8,000 (workers see only the task; director sees four ~150-token JSON briefings instead of four essays)
- Output tokens: ~4,500 (workers cap at ~600 each, director gets the 4k it needs)
- Agents: 5
Run the same swarm overnight (8 PM – 6 AM Pacific) and the token costs drop another 50% — total falls to roughly $0.118/run or $59/day. See the Night-Mode Pricing Strategy guide.
Compressing Even Harder: The tools_dictionary Pattern
The biggest savings come from forcing workers to emit structured, machine-shaped output. A tools_dictionary is just an explicit schema you bake into the system prompt — the worker’s prose is replaced by a contract.
triage_prompt as the system_prompt for every worker. The downstream director’s input shrinks from ~4,000 tokens of essays to ~600 tokens of strict JSON — a 6×+ input-token reduction at the synthesis step alone.
The Checklist
Before you ship a swarm to production, walk this list:- Cap worker
max_tokensat the smallest budget that still answers the question. Most extraction/triage agents work fine at 512–1,024. - Reserve the large
max_tokensbudget for the synthesizer only. That’s the one the user reads. - Force structured output from workers. JSON schema or
tools_dictionary. Prose-to-prose handoffs are where bills go to die. - Prune redundant agents. Two analysts saying overlapping things is two
$0.01fees plus their output tokens for no signal gain. - Run batch and back-office workloads overnight. 50% off tokens for jobs that don’t need a sub-second response — see the night-mode guide.
- Measure, don’t guess. Read
response["usage"]["billing_info"]["cost_breakdown"]after every run. The token counts tell you exactly which agent is bloated.
Reading Your Usage Block
Every swarm completion response includes ausage block you can grep for:
input_token_cost is larger than output_token_cost, you have a context-compression problem — a worker is dumping too much prose into the next agent. Fix the schema, not the model.
Next Steps
- Estimate Costs from Pricing Details — pre-flight calculator that uses the live
/v1/usage/costsrates - Pricing Details (Basic) — the unified pricing model reference
- Night-Mode Pricing Strategy — schedule overnight batches for 50% off tokens