Multi-agent systems are where the real engineering complexity in AI lives. A single agent calling a few tools is manageable. Multiple agents, each with their own tools, handing off work to each other — that's a different problem.
I've built several multi-agent systems over the past year. Here's what actually went wrong and what I did about it.
Failure Mode 1: Context Collapse
The most common failure: an agent receives a task, processes it, and passes results to the next agent. By the time the third agent in the chain sees the information, critical context has been lost or garbled.
Each agent summarizes and reformats. Information degrades at each step. The final output is confidently wrong in ways that trace back to a misinterpretation three steps earlier.
The fix: explicit context schemas between agents. Instead of agents passing natural language summaries, define structured objects that agents must populate:
from pydantic import BaseModel
from typing import Optional
class ResearchResult(BaseModel):
query: str # Original question — never gets dropped
sources: list[str] # URLs found
key_facts: list[str] # Extracted facts, verbatim when possible
uncertainty: str # What the agent wasn't sure about
next_agent: str # Who should receive this
# Agents return structured objects, not natural language
result: ResearchResult = research_agent.run(query)
synthesis_agent.run(result) # Gets the full structured context
The uncertainty field is important — it makes gaps explicit rather than having agents fill them with hallucinations.
Failure Mode 2: Infinite Loops
Agents that hand off to each other can loop. Agent A decides Agent B should handle something, Agent B decides it belongs to Agent A, they bounce the task indefinitely.
This is more common than it should be. Agents are bad at recognizing "I already handled this" without explicit state.
The fix: task IDs and a shared state store that tracks what's been attempted.
import uuid
class Task(BaseModel):
id: str = str(uuid.uuid4())
description: str
attempted_by: list[str] = []
max_attempts: int = 3
def dispatch(task: Task, agent_name: str):
if agent_name in task.attempted_by:
raise ValueError(f"Loop detected: {agent_name} already attempted task {task.id}")
if len(task.attempted_by) >= task.max_attempts:
raise ValueError(f"Max attempts reached for task {task.id}")
task.attempted_by.append(agent_name)
return agents[agent_name].run(task)
Hard limits on attempts are non-negotiable. Agents will loop without them.
Failure Mode 3: Tool Conflicts
Multiple agents with overlapping tool access modify the same resource simultaneously. One agent writes a file while another reads it. Two agents both try to create the same record. The state becomes inconsistent.
The fix: resource locking and clear ownership. Each resource should have one agent responsible for writing to it. Other agents that need that resource should request it through the owning agent, not access it directly.
This is just distributed systems discipline applied to agents.
Failure Mode 4: Silent Failures
An agent fails, returns an empty result, and the downstream agent continues without knowing there was a problem. The final output looks complete but is silently missing sections.
The fix: agents must signal their status explicitly. Never return empty — return a structured failure response:
class AgentResult(BaseModel):
success: bool
result: Optional[str]
error: Optional[str]
agent: str
duration_ms: int
# Downstream agents check success before proceeding
if not previous_result.success:
# Handle failure: retry, escalate, or fail the whole task
raise AgentFailure(f"{previous_result.agent} failed: {previous_result.error}")
Failure Mode 5: Cost Runaway
Multi-agent systems can make many more API calls than you expect. Each agent calls the LLM, possibly multiple times. A system with 5 agents can easily make 50 LLM calls for one user request.
The fix: token budgets per task, not just per agent. Track cumulative token usage across the entire agent chain and abort if it exceeds budget.
class TokenBudget:
def __init__(self, max_tokens: int):
self.max_tokens = max_tokens
self.used = 0
def consume(self, tokens: int):
self.used += tokens
if self.used > self.max_tokens:
raise BudgetExceeded(f"Used {self.used}/{self.max_tokens} tokens")
Pass the budget through every agent call. Agents that try to do more work than the budget allows fail loudly rather than running up an unexpected bill.
The Pattern That Works
The multi-agent systems that run reliably share common traits: structured data contracts between agents, explicit state tracking, hard limits on recursion and token usage, and explicit failure signaling. This is boring distributed systems engineering. It turns out agents are distributed systems, and the same lessons apply.