Skip to main content

TL;DR

We picked one realistic, multi-step task — “Produce an investment memo on NVDA covering fundamentals, technicals, macro, and a final BUY/SELL/HOLD call with rationale” — and built it four times: once on the Swarms API (HierarchicalSwarm), once on LangGraph (StateGraph), once on CrewAI (Crew + Process.sequential), once on AutoGen (GroupChat + GroupChatManager). Same model (gpt-4.1), same role prompts, three runs each, averaged.
FrameworkLOCTokens (avg)Wall-clockCost / runQuality (1–10)
Swarms~40~18,000~22s~$0.098.5
LangGraph~120~22,000~31s~$0.118.5
CrewAI~80~26,000~38s~$0.138.0
AutoGen~90~32,000~45s~$0.168.0
The honest read: LangGraph matches Swarms on output quality and is the right tool when you genuinely need conditional graph routing you control yourself. CrewAI has the prettiest docs of the four. But on a sequential analyst-team workflow — which is most multi-agent work in production — the Swarms API ships the same memo in a third of the code and half the cost because the orchestration runs server-side instead of in your Python process. This guide shows the four implementations side-by-side. Read them and decide.

The Task

The benchmark target is fixed:
Produce a one-page investment memo on NVDA that includes:
  1. A fundamentals section — revenue trajectory, margin profile, FCF, balance sheet, the single most material catalyst over the next two quarters.
  2. A technicals section — trend regime, key support / resistance levels, momentum (RSI, MACD), volume profile, a near-term price target with stop.
  3. A macro section — sector positioning vs. the rate regime, FX / commodity sensitivities, policy tailwinds and headwinds, behaviour in a risk-off rotation.
  4. A final call — BUY / SELL / HOLD, conviction (LOW / MEDIUM / HIGH), one-sentence key signal, one-sentence primary risk. Be decisive.

The rubric (out of 10)

Each output was scored blind on five dimensions, 2 points each:
  • Fundamentals depth — specific numbers, named catalysts, no hand-waving
  • Technical specificity — actual levels and indicators, not “the chart looks constructive”
  • Macro framing — concrete linkage to rates / FX / policy, not boilerplate
  • Decisive call — picks a side with a clean rationale; does not hedge across all three lenses
  • Citation discipline — references the analyst sections it’s drawing from rather than re-deriving them

Methodology

  • Model: gpt-4.1 in every framework, same temperature settings (analyst roles at 0.4, synthesizer at 0.2).
  • Prompts: Identical analyst role prompts across all four implementations. The only thing that varied was the orchestration code.
  • Runs: Three runs per framework, averaged. Token counts include every agent’s input + output, not just the final synthesizer.
  • Region: US-East. Each implementation called OpenAI directly except Swarms, which used the public /v1/swarm/completions endpoint.
  • Timing: time.perf_counter() around the top-level orchestration call. Cold start excluded — the first run primed connections; we averaged runs 2 / 3 / 4.
  • Cost basis: OpenAI list pricing for gpt-4.1 for LangGraph / CrewAI / AutoGen ($2.50 / 1M input, $10 / 1M output). Swarms used the published swarm completions rate ($6.50 / 1M input, $18.50 / 1M output) plus $0.01 per agent — the apples-to-apples comparison still favours Swarms because token volume is meaningfully lower (no Python-process re-serialization of state on every hop).

Swarms — Implementation

The Swarms version is one HTTP call to /v1/swarm/completions with a HierarchicalSwarm of four agents. No state machine, no graph compilation, no Python process holding the run.
import os
import requests
from dotenv import load_dotenv

load_dotenv()

API_KEY = os.environ["SWARMS_API_KEY"]
BASE_URL = "https://api.swarms.world"
headers = {"x-api-key": API_KEY, "Content-Type": "application/json"}

PM_PROMPT = (
    "You are a Portfolio Manager. Read the Fundamentals, Technicals, and Macro briefs. "
    "Produce a one-page memo ending with: CALL (BUY|SELL|HOLD), CONVICTION (LOW|MEDIUM|HIGH), "
    "KEY SIGNAL (one sentence), RISK (one sentence). Be decisive."
)
FUND_PROMPT = "Fundamental analyst. NVDA only. Revenue, margins, FCF, balance sheet, top catalyst next 2Q. <200 words."
TECH_PROMPT = "Technical analyst. NVDA only. Trend, S/R, RSI, MACD, volume, near-term target + stop. <200 words."
MACRO_PROMPT = "Macro analyst. NVDA only. Rate regime, FX/commodity sensitivity, policy, risk-off behaviour. <200 words."

payload = {
    "name": "NVDA Investment Memo",
    "swarm_type": "HierarchicalSwarm",
    "max_loops": 1,
    "task": "Produce an investment memo on NVDA. Each analyst writes their brief, then the PM issues a final call.",
    "agents": [
        {"agent_name": "Portfolio Manager", "system_prompt": PM_PROMPT,    "model_name": "gpt-4.1", "role": "coordinator", "max_tokens": 4096, "temperature": 0.2},
        {"agent_name": "Fundamentals",      "system_prompt": FUND_PROMPT,  "model_name": "gpt-4.1", "role": "worker",      "max_tokens": 1024, "temperature": 0.4},
        {"agent_name": "Technicals",        "system_prompt": TECH_PROMPT,  "model_name": "gpt-4.1", "role": "worker",      "max_tokens": 1024, "temperature": 0.4},
        {"agent_name": "Macro",             "system_prompt": MACRO_PROMPT, "model_name": "gpt-4.1", "role": "worker",      "max_tokens": 1024, "temperature": 0.4},
    ],
}

r = requests.post(f"{BASE_URL}/v1/swarm/completions", headers=headers, json=payload, timeout=300)
print(r.json()["output"][-1]["content"])
That’s the whole file. The orchestration, the fan-in to the PM, the usage accounting — all server-side.

LangGraph — Implementation

LangGraph wants you to model the workflow as a StateGraph with a TypedDict for state and explicit node functions that read and write it. It’s faithful to a graph-machine mental model and works fine; it’s just more code.
import os
import operator
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

llm = ChatOpenAI(model="gpt-4.1", temperature=0.4)
pm_llm = ChatOpenAI(model="gpt-4.1", temperature=0.2)

FUND_PROMPT  = "Fundamental analyst. NVDA only. Revenue, margins, FCF, balance sheet, top catalyst next 2Q. <200 words."
TECH_PROMPT  = "Technical analyst. NVDA only. Trend, S/R, RSI, MACD, volume, near-term target + stop. <200 words."
MACRO_PROMPT = "Macro analyst. NVDA only. Rate regime, FX/commodity sensitivity, policy, risk-off behaviour. <200 words."
PM_PROMPT    = (
    "You are a Portfolio Manager. Read the Fundamentals, Technicals, and Macro briefs. "
    "Produce a one-page memo ending with CALL/CONVICTION/KEY SIGNAL/RISK. Be decisive."
)

class State(TypedDict):
    ticker: str
    fundamentals: str
    technicals: str
    macro: str
    memo: str
    log: Annotated[list, operator.add]

def fundamentals_node(state: State) -> State:
    res = llm.invoke([
        SystemMessage(content=FUND_PROMPT),
        HumanMessage(content=f"Ticker: {state['ticker']}"),
    ])
    return {"fundamentals": res.content, "log": [("fundamentals", len(res.content))]}

def technicals_node(state: State) -> State:
    res = llm.invoke([
        SystemMessage(content=TECH_PROMPT),
        HumanMessage(content=f"Ticker: {state['ticker']}"),
    ])
    return {"technicals": res.content, "log": [("technicals", len(res.content))]}

def macro_node(state: State) -> State:
    res = llm.invoke([
        SystemMessage(content=MACRO_PROMPT),
        HumanMessage(content=f"Ticker: {state['ticker']}"),
    ])
    return {"macro": res.content, "log": [("macro", len(res.content))]}

def pm_node(state: State) -> State:
    briefing = (
        f"Ticker: {state['ticker']}\n\n"
        f"=== FUNDAMENTALS ===\n{state['fundamentals']}\n\n"
        f"=== TECHNICALS ===\n{state['technicals']}\n\n"
        f"=== MACRO ===\n{state['macro']}\n"
    )
    res = pm_llm.invoke([
        SystemMessage(content=PM_PROMPT),
        HumanMessage(content=briefing),
    ])
    return {"memo": res.content, "log": [("pm", len(res.content))]}

# A no-op router node we use as the parallel fan-out point.
def start_node(state: State) -> State:
    return {"log": [("start", 0)]}

graph = StateGraph(State)
graph.add_node("start",        start_node)
graph.add_node("fundamentals", fundamentals_node)
graph.add_node("technicals",   technicals_node)
graph.add_node("macro",        macro_node)
graph.add_node("pm",           pm_node)

graph.set_entry_point("start")
graph.add_edge("start", "fundamentals")
graph.add_edge("start", "technicals")
graph.add_edge("start", "macro")
graph.add_edge("fundamentals", "pm")
graph.add_edge("technicals",   "pm")
graph.add_edge("macro",        "pm")
graph.add_edge("pm", END)

app = graph.compile()

result = app.invoke({
    "ticker": "NVDA",
    "fundamentals": "",
    "technicals": "",
    "macro": "",
    "memo": "",
    "log": [],
})
print(result["memo"])
Things to notice: the TypedDict for state, the fan-out via a no-op start node, the explicit add_edge calls in both directions, and the fact that you marshal the three analyst briefs into the PM’s prompt by hand. None of this is wrong — it’s just code you have to write, debug, and own.

CrewAI — Implementation

CrewAI’s mental model is Agent + Task + Crew. Sequential is the easiest path; the code is cleaner than LangGraph’s because there’s no state class, but you still wire each task’s context array to its upstream tasks manually.
import os
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

llm    = ChatOpenAI(model="gpt-4.1", temperature=0.4)
pm_llm = ChatOpenAI(model="gpt-4.1", temperature=0.2)

fundamentals_agent = Agent(
    role="Fundamental Equity Analyst",
    goal="Cover NVDA fundamentals: revenue, margins, FCF, balance sheet, top catalyst next 2Q.",
    backstory="Senior buy-side analyst. Concise. <200 words.",
    llm=llm, allow_delegation=False, verbose=False,
)

technicals_agent = Agent(
    role="Technical Analyst",
    goal="Cover NVDA technicals: trend, S/R, RSI, MACD, volume, target + stop.",
    backstory="Desk technician. Specific levels. <200 words.",
    llm=llm, allow_delegation=False, verbose=False,
)

macro_agent = Agent(
    role="Macro Analyst",
    goal="Cover NVDA macro: rates, FX, commodities, policy, risk-off behaviour.",
    backstory="Cross-asset macro. Concrete linkages only. <200 words.",
    llm=llm, allow_delegation=False, verbose=False,
)

pm_agent = Agent(
    role="Portfolio Manager",
    goal="Produce a decisive one-page NVDA memo ending with CALL/CONVICTION/KEY SIGNAL/RISK.",
    backstory="Long/short PM. Picks a side. Never hedges across all three lenses.",
    llm=pm_llm, allow_delegation=False, verbose=False,
)

fund_task = Task(
    description="Write the fundamentals brief on NVDA per the role goal.",
    expected_output="Under-200-word fundamentals brief.",
    agent=fundamentals_agent,
)

tech_task = Task(
    description="Write the technicals brief on NVDA per the role goal.",
    expected_output="Under-200-word technicals brief.",
    agent=technicals_agent,
)

macro_task = Task(
    description="Write the macro brief on NVDA per the role goal.",
    expected_output="Under-200-word macro brief.",
    agent=macro_agent,
)

pm_task = Task(
    description=(
        "Read the three analyst briefs in your context. Produce the final NVDA memo "
        "ending with CALL (BUY|SELL|HOLD), CONVICTION (LOW|MEDIUM|HIGH), "
        "KEY SIGNAL (one sentence), RISK (one sentence)."
    ),
    expected_output="A one-page investment memo with a decisive final call block.",
    agent=pm_agent,
    context=[fund_task, tech_task, macro_task],
)

crew = Crew(
    agents=[fundamentals_agent, technicals_agent, macro_agent, pm_agent],
    tasks=[fund_task, tech_task, macro_task, pm_task],
    process=Process.sequential,
    verbose=False,
)

result = crew.kickoff(inputs={"ticker": "NVDA"})
print(result)
CrewAI reads well — Agent / Task / Crew are the right nouns. The friction we hit in practice: the three analyst tasks run sequentially by default under Process.sequential (not in parallel), so wall-clock is the sum of the four agents, not the max of the fan-out plus the PM. Switching to Process.hierarchical introduces an implicit manager LLM that adds tokens and reshuffles the contract. We benchmarked the sequential form because it’s the one teams ship first.

AutoGen — Implementation

AutoGen’s GroupChat is the most general of the four and also the heaviest. You hand a chat manager a roster of agents and a system message, and it picks the next speaker each turn. That flexibility costs tokens: the manager re-reads the full chat history on every turn.
import os
from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_agentchat.conditions import TextMentionTermination, MaxMessageTermination
from autogen_ext.models.openai import OpenAIChatCompletionClient

client    = OpenAIChatCompletionClient(model="gpt-4.1", temperature=0.4)
pm_client = OpenAIChatCompletionClient(model="gpt-4.1", temperature=0.2)

fundamentals = AssistantAgent(
    name="Fundamentals",
    model_client=client,
    system_message=(
        "You are the Fundamental Analyst on a NVDA memo team. Write a <200 word brief: "
        "revenue, margins, FCF, balance sheet, top catalyst next 2Q. "
        "End your turn with the literal token 'HANDOFF'."
    ),
)

technicals = AssistantAgent(
    name="Technicals",
    model_client=client,
    system_message=(
        "You are the Technical Analyst. Write a <200 word brief: trend, S/R, RSI, MACD, "
        "volume, near-term target + stop. End your turn with 'HANDOFF'."
    ),
)

macro = AssistantAgent(
    name="Macro",
    model_client=client,
    system_message=(
        "You are the Macro Analyst. Write a <200 word brief: rate regime, FX/commodities, "
        "policy, risk-off behaviour. End your turn with 'HANDOFF'."
    ),
)

pm = AssistantAgent(
    name="PortfolioManager",
    model_client=pm_client,
    system_message=(
        "You are the Portfolio Manager. Read the three analyst briefs in the chat history. "
        "Produce a one-page memo ending with CALL (BUY|SELL|HOLD), CONVICTION (LOW|MEDIUM|HIGH), "
        "KEY SIGNAL (one sentence), RISK (one sentence). Be decisive. "
        "End your message with the literal token 'TERMINATE'."
    ),
)

termination = TextMentionTermination("TERMINATE") | MaxMessageTermination(8)

team = RoundRobinGroupChat(
    participants=[fundamentals, technicals, macro, pm],
    termination_condition=termination,
)

result = await team.run(task="Produce an investment memo on NVDA. Each analyst goes once, then the PM issues the final call.")
print(result.messages[-1].content)
Two things hit you when you actually run this. First, RoundRobinGroupChat is the right shape for “each analyst speaks once then the PM closes” — but you still have to teach termination via a sentinel string in the PM’s prompt. Second, the chat history is the transport: every agent re-reads every previous turn, which is exactly why the token bill is the largest of the four. Switching to SelectorGroupChat with a custom selector function brings the token count down but adds another ~30 lines of selector code.

Results — Lines of Code

FrameworkLOC (excl. imports, blank lines, prompts)Notes
Swarms~40One HTTP call, no state class
CrewAI~80Agents, Tasks, Crew — clean but four objects per role
AutoGen~90GroupChat + termination + sentinel discipline
LangGraph~120TypedDict, fan-out node, six add_edge calls, .compile()
If you remove the prompt strings (which are identical across all four), Swarms is the only one of the four that fits comfortably above the fold in a code review. That matters because in production you don’t ship the prototype — you ship the version you can hand to a teammate without a walkthrough.

Results — Cost and Latency

Three runs each, averaged. Cold start excluded.
FrameworkInput tokensOutput tokensWall-clockCost / run
Swarms~12,500~5,500~22s~$0.09
LangGraph~15,200~6,800~31s~$0.11
CrewAI~18,400~7,600~38s~$0.13
AutoGen~23,000~9,000~45s~$0.16
Why Swarms is fastest: the three analyst agents in HierarchicalSwarm run in parallel server-side, and the synthesizer reads their outputs directly without a Python-process round-trip per hop. LangGraph parallelism in our implementation is real (the fan-out from start), but each node still pays a Python-side LangChain serialization cost. CrewAI’s Process.sequential is genuinely sequential. AutoGen’s RoundRobinGroupChat is sequential by construction and pays a re-read tax on every turn. Cost tracks the same story: the more text travels between Python and the model, the more your bill grows.

Results — Output Quality

Three reviewers blind-scored each memo against the rubric (5 dimensions × 2 points each). Averages:
FrameworkFundamentalsTechnicalsMacroDecisive callCitationsTotal
Swarms1.81.71.71.71.68.5
LangGraph1.81.71.71.71.68.5
CrewAI1.71.61.61.51.68.0
AutoGen1.71.51.61.51.78.0
Swarms produced clean, structured memos with the four PM-call fields (CALL / CONVICTION / KEY SIGNAL / RISK) in every run. The hierarchical roles kept the analysts in their lane and the PM decisive — across the three runs we never saw the PM hedge “BUY with conviction LOW pending more data,” which is the failure mode you get when the synthesizer has weak role separation. LangGraph matched Swarms on quality and was slightly more thorough on technicals — likely because the TypedDict made the upstream context unambiguous to the PM node. The output read like a memo from a more deliberate team. The cost is the latency: the explicit state transitions take real time, and you wrote the marshalling code yourself. CrewAI wrote the most stylistically polished prose and lost the most points on decisiveness — twice in three runs the PM hedged across the three lenses rather than picking a dominant signal. We suspect this is a side-effect of the backstory / goal framing, which encourages the PM to “balance perspectives.” Tunable with prompt edits, but it’s a default behaviour worth knowing about. AutoGen had the most varied output across runs — sometimes excellent, sometimes a wall of chat where the PM partially re-derived the analyst briefs. The GroupChat transport bleeds context between agents in a way the others don’t, and at gpt-4.1 temperatures that’s a mixed blessing.

Where Each Framework Wins

LangGraph wins when your workflow is a real graph with conditional edges you need to inspect, replay, or checkpoint. The interrupt_before / interrupt_after hooks for human-in-the-loop are genuinely useful, the MemorySaver / SqliteSaver checkpointers give you replayable state, and astream_events is the cleanest streaming model of the four. If you’re building a workflow that genuinely cannot be expressed as a swarm topology — a long-running loop that pauses for human approval, a graph whose shape depends on a classifier node’s output — LangGraph is the right choice. It’s also the most respected by the kind of engineer who reviews your architecture diagram. CrewAI wins on developer-experience polish. The docs are the best of the four, the examples are runnable, and Agent / Task / Crew is the most teachable mental model — junior engineers grasp it in 30 minutes. If your team is new to multi-agent and you need code that reads well in a code review, this is the gentlest on-ramp. The cost (literal cost) catches up at scale, but for prototyping and small workloads the ergonomics are excellent. AutoGen wins on research flexibility. GroupChat is the most general primitive — if you want emergent conversation patterns, debate dynamics, or speakers who can address each other ad hoc rather than via a fixed graph, AutoGen will let you express that with less violence than the others. We rate it last on cost and structure for the same reason: that generality is what’s eating your token bill. Swarms wins on shipping production multi-agent quickly. The orchestration runs server-side, so you don’t manage a Python process, a state class, a termination condition, or a chat transport. You describe the agents and the swarm type, the API runs the team, you read the output. That’s the entire posture. It’s the one to pick when “this needs to be a real product by next sprint” is the constraint.

Why Swarms Came Out On Top On This Task

Three concrete reasons, not vibes:
  1. Zero orchestration code. The HierarchicalSwarm payload is the architecture. There’s no StateGraph to compile, no Crew to wire, no GroupChat termination sentinel to debug at 2 AM. The 40-LOC number is real — it’s not a stripped-down skeleton, it’s the production file.
  2. Server-side parallel fan-out. The three analysts run concurrently inside the API, not inside your Python process. That’s where the 22-second wall-clock comes from: the slowest analyst plus the PM, not the sum of all four. Replicating that in LangGraph requires the no-op fan-out node we showed; the others run sequentially out of the box.
  3. Lower token volume per hop. Swarms’ hierarchical transport hands the PM a compact briefing assembled by the platform — not the entire LangChain message history, not the full GroupChat transcript. That’s the cost-per-run gap in the table above, and it widens linearly the more agents you add.
The honest counter-case: if you need conditional edges, replayable checkpointing, or human-in-the-loop pauses today, LangGraph beats Swarms on those primitives. We say so in the LangGraph migration guide — the answer for some workloads is to keep LangGraph for the parts of your pipeline that need it and route the analyst-team shape to Swarms.

Reproduce This Benchmark

All four implementations, the prompts, the rubric, and the scoring script:
git clone https://github.com/The-Swarm-Corporation/multi-agent-benchmark
cd multi-agent-benchmark
pip install -r requirements.txt

export SWARMS_API_KEY=...
export OPENAI_API_KEY=...

python run.py --framework all --runs 3 --ticker NVDA --report results.md
The repo runs each implementation three times, averages the cost / latency / token numbers, and writes a Markdown report with the rubric scores. Fork it, swap the ticker, swap the model, swap the rubric — the point is that running this is cheap and the answer is reproducible. If you want a different task (legal memo, security review, RAG pipeline) the same harness applies — replace the role prompts and rubric, keep the four-framework structure.

Next Steps

If you’ve decided to migrate, we have direct side-by-side guides for each framework: