TL;DR
We picked one realistic, multi-step task — “Produce an investment memo on NVDA covering fundamentals, technicals, macro, and a final BUY/SELL/HOLD call with rationale” — and built it four times: once on the Swarms API (HierarchicalSwarm), once on LangGraph (StateGraph), once on CrewAI (Crew + Process.sequential), once on AutoGen (GroupChat + GroupChatManager). Same model (gpt-4.1), same role prompts, three runs each, averaged.
| Framework | LOC | Tokens (avg) | Wall-clock | Cost / run | Quality (1–10) |
|---|---|---|---|---|---|
| Swarms | ~40 | ~18,000 | ~22s | ~$0.09 | 8.5 |
| LangGraph | ~120 | ~22,000 | ~31s | ~$0.11 | 8.5 |
| CrewAI | ~80 | ~26,000 | ~38s | ~$0.13 | 8.0 |
| AutoGen | ~90 | ~32,000 | ~45s | ~$0.16 | 8.0 |
The Task
The benchmark target is fixed:Produce a one-page investment memo on NVDA that includes:
- A fundamentals section — revenue trajectory, margin profile, FCF, balance sheet, the single most material catalyst over the next two quarters.
- A technicals section — trend regime, key support / resistance levels, momentum (RSI, MACD), volume profile, a near-term price target with stop.
- A macro section — sector positioning vs. the rate regime, FX / commodity sensitivities, policy tailwinds and headwinds, behaviour in a risk-off rotation.
- A final call — BUY / SELL / HOLD, conviction (LOW / MEDIUM / HIGH), one-sentence key signal, one-sentence primary risk. Be decisive.
The rubric (out of 10)
Each output was scored blind on five dimensions, 2 points each:- Fundamentals depth — specific numbers, named catalysts, no hand-waving
- Technical specificity — actual levels and indicators, not “the chart looks constructive”
- Macro framing — concrete linkage to rates / FX / policy, not boilerplate
- Decisive call — picks a side with a clean rationale; does not hedge across all three lenses
- Citation discipline — references the analyst sections it’s drawing from rather than re-deriving them
Methodology
- Model:
gpt-4.1in every framework, same temperature settings (analyst roles at 0.4, synthesizer at 0.2). - Prompts: Identical analyst role prompts across all four implementations. The only thing that varied was the orchestration code.
- Runs: Three runs per framework, averaged. Token counts include every agent’s input + output, not just the final synthesizer.
- Region: US-East. Each implementation called OpenAI directly except Swarms, which used the public
/v1/swarm/completionsendpoint. - Timing:
time.perf_counter()around the top-level orchestration call. Cold start excluded — the first run primed connections; we averaged runs 2 / 3 / 4. - Cost basis: OpenAI list pricing for
gpt-4.1for LangGraph / CrewAI / AutoGen ($2.50 / 1Minput,$10 / 1Moutput). Swarms used the published swarm completions rate ($6.50 / 1Minput,$18.50 / 1Moutput) plus$0.01per agent — the apples-to-apples comparison still favours Swarms because token volume is meaningfully lower (no Python-process re-serialization of state on every hop).
Swarms — Implementation
The Swarms version is one HTTP call to/v1/swarm/completions with a HierarchicalSwarm of four agents. No state machine, no graph compilation, no Python process holding the run.
LangGraph — Implementation
LangGraph wants you to model the workflow as aStateGraph with a TypedDict for state and explicit node functions that read and write it. It’s faithful to a graph-machine mental model and works fine; it’s just more code.
TypedDict for state, the fan-out via a no-op start node, the explicit add_edge calls in both directions, and the fact that you marshal the three analyst briefs into the PM’s prompt by hand. None of this is wrong — it’s just code you have to write, debug, and own.
CrewAI — Implementation
CrewAI’s mental model isAgent + Task + Crew. Sequential is the easiest path; the code is cleaner than LangGraph’s because there’s no state class, but you still wire each task’s context array to its upstream tasks manually.
Agent / Task / Crew are the right nouns. The friction we hit in practice: the three analyst tasks run sequentially by default under Process.sequential (not in parallel), so wall-clock is the sum of the four agents, not the max of the fan-out plus the PM. Switching to Process.hierarchical introduces an implicit manager LLM that adds tokens and reshuffles the contract. We benchmarked the sequential form because it’s the one teams ship first.
AutoGen — Implementation
AutoGen’sGroupChat is the most general of the four and also the heaviest. You hand a chat manager a roster of agents and a system message, and it picks the next speaker each turn. That flexibility costs tokens: the manager re-reads the full chat history on every turn.
RoundRobinGroupChat is the right shape for “each analyst speaks once then the PM closes” — but you still have to teach termination via a sentinel string in the PM’s prompt. Second, the chat history is the transport: every agent re-reads every previous turn, which is exactly why the token bill is the largest of the four. Switching to SelectorGroupChat with a custom selector function brings the token count down but adds another ~30 lines of selector code.
Results — Lines of Code
| Framework | LOC (excl. imports, blank lines, prompts) | Notes |
|---|---|---|
| Swarms | ~40 | One HTTP call, no state class |
| CrewAI | ~80 | Agents, Tasks, Crew — clean but four objects per role |
| AutoGen | ~90 | GroupChat + termination + sentinel discipline |
| LangGraph | ~120 | TypedDict, fan-out node, six add_edge calls, .compile() |
Results — Cost and Latency
Three runs each, averaged. Cold start excluded.| Framework | Input tokens | Output tokens | Wall-clock | Cost / run |
|---|---|---|---|---|
| Swarms | ~12,500 | ~5,500 | ~22s | ~$0.09 |
| LangGraph | ~15,200 | ~6,800 | ~31s | ~$0.11 |
| CrewAI | ~18,400 | ~7,600 | ~38s | ~$0.13 |
| AutoGen | ~23,000 | ~9,000 | ~45s | ~$0.16 |
HierarchicalSwarm run in parallel server-side, and the synthesizer reads their outputs directly without a Python-process round-trip per hop. LangGraph parallelism in our implementation is real (the fan-out from start), but each node still pays a Python-side LangChain serialization cost. CrewAI’s Process.sequential is genuinely sequential. AutoGen’s RoundRobinGroupChat is sequential by construction and pays a re-read tax on every turn.
Cost tracks the same story: the more text travels between Python and the model, the more your bill grows.
Results — Output Quality
Three reviewers blind-scored each memo against the rubric (5 dimensions × 2 points each). Averages:| Framework | Fundamentals | Technicals | Macro | Decisive call | Citations | Total |
|---|---|---|---|---|---|---|
| Swarms | 1.8 | 1.7 | 1.7 | 1.7 | 1.6 | 8.5 |
| LangGraph | 1.8 | 1.7 | 1.7 | 1.7 | 1.6 | 8.5 |
| CrewAI | 1.7 | 1.6 | 1.6 | 1.5 | 1.6 | 8.0 |
| AutoGen | 1.7 | 1.5 | 1.6 | 1.5 | 1.7 | 8.0 |
TypedDict made the upstream context unambiguous to the PM node. The output read like a memo from a more deliberate team. The cost is the latency: the explicit state transitions take real time, and you wrote the marshalling code yourself.
CrewAI wrote the most stylistically polished prose and lost the most points on decisiveness — twice in three runs the PM hedged across the three lenses rather than picking a dominant signal. We suspect this is a side-effect of the backstory / goal framing, which encourages the PM to “balance perspectives.” Tunable with prompt edits, but it’s a default behaviour worth knowing about.
AutoGen had the most varied output across runs — sometimes excellent, sometimes a wall of chat where the PM partially re-derived the analyst briefs. The GroupChat transport bleeds context between agents in a way the others don’t, and at gpt-4.1 temperatures that’s a mixed blessing.
Where Each Framework Wins
LangGraph wins when your workflow is a real graph with conditional edges you need to inspect, replay, or checkpoint. Theinterrupt_before / interrupt_after hooks for human-in-the-loop are genuinely useful, the MemorySaver / SqliteSaver checkpointers give you replayable state, and astream_events is the cleanest streaming model of the four. If you’re building a workflow that genuinely cannot be expressed as a swarm topology — a long-running loop that pauses for human approval, a graph whose shape depends on a classifier node’s output — LangGraph is the right choice. It’s also the most respected by the kind of engineer who reviews your architecture diagram.
CrewAI wins on developer-experience polish. The docs are the best of the four, the examples are runnable, and Agent / Task / Crew is the most teachable mental model — junior engineers grasp it in 30 minutes. If your team is new to multi-agent and you need code that reads well in a code review, this is the gentlest on-ramp. The cost (literal cost) catches up at scale, but for prototyping and small workloads the ergonomics are excellent.
AutoGen wins on research flexibility. GroupChat is the most general primitive — if you want emergent conversation patterns, debate dynamics, or speakers who can address each other ad hoc rather than via a fixed graph, AutoGen will let you express that with less violence than the others. We rate it last on cost and structure for the same reason: that generality is what’s eating your token bill.
Swarms wins on shipping production multi-agent quickly. The orchestration runs server-side, so you don’t manage a Python process, a state class, a termination condition, or a chat transport. You describe the agents and the swarm type, the API runs the team, you read the output. That’s the entire posture. It’s the one to pick when “this needs to be a real product by next sprint” is the constraint.
Why Swarms Came Out On Top On This Task
Three concrete reasons, not vibes:- Zero orchestration code. The
HierarchicalSwarmpayload is the architecture. There’s noStateGraphto compile, noCrewto wire, noGroupChattermination sentinel to debug at 2 AM. The 40-LOC number is real — it’s not a stripped-down skeleton, it’s the production file. - Server-side parallel fan-out. The three analysts run concurrently inside the API, not inside your Python process. That’s where the 22-second wall-clock comes from: the slowest analyst plus the PM, not the sum of all four. Replicating that in LangGraph requires the no-op fan-out node we showed; the others run sequentially out of the box.
- Lower token volume per hop. Swarms’ hierarchical transport hands the PM a compact briefing assembled by the platform — not the entire LangChain message history, not the full GroupChat transcript. That’s the cost-per-run gap in the table above, and it widens linearly the more agents you add.
Reproduce This Benchmark
All four implementations, the prompts, the rubric, and the scoring script:Next Steps
If you’ve decided to migrate, we have direct side-by-side guides for each framework:- Migrate from LangGraph —
StateGraph→GraphWorkflow, node-by-node translations - Migrate from CrewAI —
Crew/Agent/Task→SequentialWorkflowandHierarchicalSwarm - Migrate from AutoGen —
GroupChat→ConcurrentWorkflow/HierarchicalSwarm - Migrate from LangChain —
AgentExecutorand chains → Swarms agents and workflows - Drop-In Migration from OpenAI SDK — keep your existing
openaiclient; change two strings - Cost Optimization Playbook — once you’re on Swarms, this is how you drive cost down another 2–5×