What This Guide Covers
- Why one-size-fits-all model selection is the wrong default for any production swarm
- Four concrete cost patterns you can drop into existing pipelines without re-architecting
- The 50% night-mode discount window on batch endpoints and how to claim it
- A reproducible cost table comparing naive (all-flagship) to a tiered + batch + night-mode setup
- The two configuration levers (
max_tokensandmax_loops) that quietly drive most of your spend
The goal of this guide is not to make your agents cheaper at the expense of quality. It is to put your most expensive model only where it changes the answer — and to use cheaper models, batch endpoints, and the night-mode window everywhere else.
Why This Matters
Most production Swarms bills look the same when you trace them: one or two agents do work that genuinely requires a flagship model (synthesis, hard reasoning, final write-up) and three or four agents do work a cheaper model would handle identically (classification, extraction, formatting, routing). Running every agent on the flagship is the default — and the default is wrong. The job to be done is not “use the best model.” It is “produce a defensible artifact at the lowest cost-per-unit-of-quality.” The patterns below are the levers that move that ratio, in priority order.The Cost-Capability Trade-Off
Anthropic and OpenAI both publish three rough tiers, and the ratios are roughly the same across providers:| Tier | Anthropic | OpenAI | Use For |
|---|---|---|---|
| Cheap | Haiku family | gpt-4.1-mini, gpt-4.1-mini | Triage, classification, extraction, routing, formatting |
| Mid | Sonnet family | gpt-4.1, gpt-4.1 | Most analysis, most worker agents, drafts |
| Flagship | anthropic/claude-opus-4-8 | (top OpenAI reasoning tier) | Synthesis, final judgments, multi-step reasoning, the agent that signs the memo |
Pattern 1: Tiered Models in a Single Swarm
In aHierarchicalSwarm, the director synthesizes — that’s the agent that benefits from a flagship model. The workers each own a narrow lane and rarely need the same horsepower. Mix tiers in one swarm config:
When the flagship in the config is
anthropic/claude-opus-4-8, do not set temperature. See Claude Opus 4.8 for the full rationale — Anthropic’s API will reject the request if temperature is supplied.Pattern 2: Two-Pass Filtering
Most production workloads are heavily skewed: 70-90% of incoming items don’t need the expensive analyst. A cheap classifier agent decides whether the expensive one runs at all. This is the highest-ROI pattern in this guide for any high-volume queue (support tickets, claim triage, document review, lead scoring).Pattern 3: Batch Endpoints + Night Mode
The Swarms platform applies a 50% night-time discount on input and output token costs for traffic processed between 8 PM and 6 AM Pacific (America/Los_Angeles). The discount is implemented in calculate_swarm_cost — see api/utils.py — and applies to billed token costs (the per-agent fixed component is unaffected). The platform decides the discount based on the server clock when the work is processed, so the way you capture it is to send the work during that window, typically via the batch endpoints.
Two endpoints matter here:
/v1/agent/batch/completions— array of single-agent jobs in one request/v1/swarm/batch/completions— array of multi-agent swarm jobs in one request
cron entry on a Pacific-time host is enough; for cloud schedulers, anchor on America/Los_Angeles and fire any time between 8 PM and 6 AM:
Night-mode is a 50% discount on token costs, not on the per-agent base charge. For token-heavy swarms (long inputs, long outputs) it cuts the bill roughly in half. For very short calls dominated by the per-agent fixed cost, the effective savings is smaller. Larger jobs benefit more.
Pattern 4: Cap Tokens and Loops
max_tokens and max_loops are the most direct, least glamorous, most effective levers in your config. Most production swarms ship with both set carelessly high “just in case.” That’s where the silent spend hides.
Conservative defaults that work in production:
| Agent role | max_tokens | max_loops | Notes |
|---|---|---|---|
| Classifier / triage | 16 - 64 | 1 | One-word or short-label outputs |
| Extraction (fields from a doc) | 512 - 1024 | 1 | Structured output, bounded |
| Worker doing one analysis lane | 2048 - 4096 | 1 | Most swarm workers live here |
| Synthesizer / director | 4096 - 8192 | 1 - 2 | Only raise loops if the task is genuinely iterative |
| Long-form research memo | 8192 | 1 | Higher is rarely the right answer; chain agents instead |
- Default
max_loopsto 1. Raise it only when you have evidence a single pass underperforms. Each additional loop multiplies cost roughly linearly and helps less than chaining a fresh agent. - Set
max_tokensclose to what the agent should actually produce. A classifier withmax_tokens=4096is paying for headroom it will never use, plus the long-tail risk of the model going long. Bound it.
Real-World Numbers
Take a realistic production workload: an investment-research team running 500 single-agent summaries plus 50 multi-agent deep-dive swarms per day. The naive setup runs everything on a flagship model, in the middle of the business day, with generousmax_tokens and max_loops. The optimized setup applies all four patterns above.
Assume rough per-million-token costs of flagship ~$15 input / $75 output, mid-tier ~$3 input / $15 output, cheap ~$0.30 input / $1.20 output. (Use these for relative scale; check your provider’s published rates for current values.)
| Workload | Naive (all flagship, peak hours) | Optimized (tiered + batch + night) |
|---|---|---|
| 500 single-agent summaries × ~600 input / ~300 output tokens | ~$15.75/day | Cheap-tier + night mode: ~$0.36/day |
| 50 deep-dive swarms × 4 agents × ~2k input / ~1.5k output | ~$28.50/day | Director-only flagship, workers cheap, batched at night: ~$5.10/day |
| Total daily | ~$44.25 | ~$5.46 |
| Effective multiplier | — | ~8x cheaper |
max_tokens was set close to the actual output length, and (4) the whole pipeline ran during the night-mode window. Any one of them in isolation saves money. Stacked, they consistently produce a 5-10x reduction on real workloads.
A Checklist Before You Ship
Run this list against any swarm config heading to production:- Does every agent need the model it’s currently using? Demote any worker whose job is classification, extraction, or formatting.
- Is
max_tokensbounded close to the expected output length on every agent? - Is
max_loopsset to 1 unless you have measured that more loops change the answer? - Could a cheap classifier filter the queue before the expensive agent runs (Pattern 2)?
- Can the workload run overnight on
/v1/agent/batch/completionsor/v1/swarm/batch/completionsfor the 50% night discount (Pattern 3)? - Have you confirmed the flagship agent is reserved for the role that genuinely benefits — synthesis, final judgment, the agent whose output is the artifact?
Next Steps
- Scale single-agent workloads with Batch Agent Completions
- Run many swarms in one request with Batch Swarm Completions
- See the tiered hierarchical pattern end-to-end in the Hierarchical Workflow Example