Why I Stopped Using One Agent and Built a Multi-Agent Pipeline Instead | Towards Data Science

we developed a text-to-SQL application. It was a simple one-agent architecture that takes a user query in natural language and after passing the validations checkpoint it generates a SQL query for it. After completing the development we started testing it and realised that one agent architecture was not enough for our application. It worked for simple queries where only 1 or 2 operations were required but it started failing when the user started asking complex queries, including multiple operations and schema scans.

After thorough testing of the application we realised that a single agent couldn’t perform every task. The agent was trying to parse the intent, map it to a schema, generate a valid SQL, and validate its own output all in one go. By the third retry, the context was so bloated with failed attempts and self-corrections that the agent started contradicting itself.

After that we decided to revisit our architecture and implemented multi-agent system with specialised agents for every task instead of relying on just one agent to do everything.

Why single agents struggle with complex tasks
What multi-agent architecture actually means in practice
How to design specialised agents and wire them together
Orchestration and state management with LangGraph
How agent nodes are actually implemented
The feedback loop and retry logic
What breaks in production
When you shouldn’t use this
Conclusion

Why a Single Agent Struggles

There’s an assumption when you first start building with LLMs that if the model is capable enough, a good prompt can do everything. This is achievable for simpler tasks but as the task grows in complexity you’re not just asking the model to think harder, you’re asking it to hold multiple competing mental models in a single context window simultaneously.

Take the text-to-SQL problem as an example. To turn a natural language question into a correct SQL query, you need to:

Parse what the user actually wants (intent decomposition)
Map those intents against the real schema (schema awareness)
Write syntactically correct SQL (query generation)
Verify the output SQL (validation)

Intent decomposition is an NLP task and it requires the agent to understand the user’s query and break down its hidden meaning. Schema mapping requires precise factual grounding of the database to map the intent against the correct table schemas. Query generation demands complete schema and syntactic knowledge. Validation needs a critical, adversarial lens to judge the generated SQL.

A single agent trying to do all four will do each of them with mediocrity but none of them well. The failures are very basic, like a join condition that’s almost right but wrong for the user’s context, a filter that ignores a time constraint, or an aggregate that looks correct until you check the edge cases and it fails.

Every failed attempt stays in memory. The model sees what it tried before and makes increasingly small adjustments, rather than stepping back and rethinking. After three retries you’re not getting a fresh attempt but a revision of a bad first draft.

What Multi-Agent Actually Means

A multi-agent system is a collection of agents, each with a specific responsibility, coordinated by an orchestrator that manages the flow between them. Instead of one agent doing everything with mediocre results, you have several agents that each specialised in one thing.

In practice, there are two ways to create a multi-agent system. The first is sequential, where agents run one after another, each receiving the output of the previous one as its input. The second is parallel, where agents run simultaneously on independent sub-tasks and their outputs are merged by an aggregator.

Most real systems use both. Our text-to-SQL pipeline is sequential by design: you can’t map schema until you’ve parsed intent, and you can’t validate until you have a query. But within that sequence, you might run parallel lookups against different schema sections simultaneously.

One thing to understand is that the orchestrator itself can be an LLM agent, or it can be deterministic routing logic. For a pipeline where the flow is well-defined and predictable, deterministic routing is almost always the better choice. You don’t need an LLM to decide “run schema mapping after intent parsing”, that decision never changes.

Designing the Agents

Why I Stopped Using One Agent and Built a Multi-Agent Pipeline Instead | Towards Data Science — **text-to-SQL multi-agent architecture diagram**

For a text-to-SQL system, the agent breakdown maps directly to the failure modes we saw with a single agent. We asked it a question for “extracting the top customers in each category and compare their purchase trend against the category average over the last 12 months,” and it helped us to scope every agent in our system.

The Intent Parser Agent takes the raw user question and decomposes it into discrete intents. The question about top customers contains at least three intents: rank customers by category, compute their purchase trend, and compare that trend against a category-level baseline. A single agent doing this inline tends to partially decompose and then generate SQL for an incomplete interpretation. This agent’s only job is decomposing the user’s query into different intents.

The Schema Agent receives the decomposed intents and maps them to actual table names, column names, join conditions, and data types from your database schema. This is where a single agents system fails because without explicit schema grounding, the agent invents column names that look plausible but don’t exist. For example customer_purchase_value sounds a real column in the customer transaction database but this column doesn’t exist in our database. Keeping this as a separate agent with the schema injected directly into its context solves the problem cleanly.

The Query Builder Agent takes the schema-mapped intents and generates SQL. By the time this agent runs, the ambiguity is already resolved and it’s doing a focused generation task, not interpretation. The output quality difference compared to a single agent doing all of this together is significant.

The Critic Agent is the one most teams skip and later regret. The query builder will produce a syntactically valid query but that’s different from being semantically correct. The critic agent receives the generated query and independently evaluates it against the original intents: does this actually answer what was asked? Are there edge cases it’s missing? Does the time window filter match what the user specified? You cannot credibly do this in the same context as generation because the model anchors to what it just wrote and will rationalise its own output rather than challenge it. A separate agent with a fresh context and an explicitly adversarial system prompt catches things the builder never would.

The Response Agent formats the final query for the user, adds a plain-language explanation of what it does, and surfaces any assumptions made during generation.

Each of these agents has a different system prompt, a different role, and most importantly, a fresh context window.

Wiring It Together with LangGraph

LangGraph is a good fit here because it gives you explicit control over state and edges. You define the graph, the nodes, and exactly how data flows between them. Nothing is abstracted away from you.

The state object travels through the entire pipeline. Every agent reads from it and writes back to it:

from typing import TypedDict, List, Optional

class PipelineState(TypedDict):
    user_query: str
    intents: Optional[List[dict]]
    schema_mapping: Optional[dict]
    generated_query: Optional[str]
    critique: Optional[dict]
    final_response: Optional[str]
    retry_count: int
    failure_source: Optional[str]  # tracks which agent caused a failure

The failure_source field is very important to include in the agent’s state. When something goes wrong in production you need to know whether the critic is rejecting because the query builder failed or because the intent parser sent garbage downstream.

The graph structure itself is very easy and straightforward to implement (Simplified logic for readability):

from langgraph.graph import StateGraph, END

def should_retry(state: PipelineState) -> str:
    critique = state.get("critique", {})
    if critique.get("passed"):
        return "respond"
    if state["retry_count"] >= 3:
        return "respond"  # surface best attempt, don't loop forever
    return "rebuild"

graph = StateGraph(PipelineState)

graph.add_node("parse_intent", intent_parser_node)
graph.add_node("map_schema", schema_agent_node)
graph.add_node("build_query", query_builder_node)
graph.add_node("critique", critic_agent_node)
graph.add_node("respond", response_agent_node)

graph.set_entry_point("parse_intent")
graph.add_edge("parse_intent", "map_schema")
graph.add_edge("map_schema", "build_query")
graph.add_edge("build_query", "critique")
graph.add_conditional_edges("critique", should_retry, {
    "rebuild": "build_query",
    "respond": "respond"
})
graph.add_edge("respond", END)

app = graph.compile()

The retry_count ceiling is non-negotiable here. Without it, a pipeline that keeps failing critique will loop indefinitely. Three retries means four total attempts, and if the system hasn’t produced an acceptable query by then, surface the best attempt and flag it for review rather than burning tokens in a spiral.

How Agent Nodes Are Actually Implemented

The graph definition above shows the structure but the real work happens inside each of the nodes.

Each node is a function that takes the current state, does something with it, and returns the fields it wants to update. Here’s what the intent parser and critic agent looks like in practice (Simplified logic for readability):

from langchain_core.messages import SystemMessage, HumanMessage
from langchain_google_vertexai import ChatVertexAI
import json

llm = ChatVertexAI(
    model="gemini-2.5-flash",
    temperature=0,
    max_tokens=None
)

def intent_parser_node(state: PipelineState) -> dict:
    system = SystemMessage(content="""You are an intent decomposition specialist.
Your only job is to break a natural language question into discrete analytical intents.
Return a JSON list. Each intent should have: 'description', 'metric', 'filters', 'time_range'.
Do not generate SQL. Do not reference any database schema.""")

    human = HumanMessage(content=state["user_query"])
    response = llm.invoke([system, human])

    intents = json.loads(response.content)
    return {"intents": intents}


def critic_agent_node(state: PipelineState) -> dict:
    system = SystemMessage(content="""You are a SQL query critic. Your job is to find problems.
Given the original intents and the generated SQL query, evaluate:
1. Does the query answer all intents, or only some of them?
2. Are there any missing filters, wrong aggregations, or incorrect joins?
3. Does the time range handling match what was asked?
Return JSON with: 'passed' (bool), 'issues' (list of strings), 'severity' (low/medium/high).""")

    human = HumanMessage(content=f"""
Original intents: {json.dumps(state['intents'], indent=2)}
Generated query: {state['generated_query']}
""")

    response = llm.invoke([system, human])
    critique = json.loads(response.content)

    return {
        "critique": critique,
        "retry_count": state["retry_count"] + 1,
        "failure_source": "query_builder" if not critique["passed"] else None
    }

A few things to notice: the intent parser’s system prompt explicitly says do not generate SQL and do not reference any database schema. That’s intentional because without those constraints the model will drift and it’ll start making schema assumptions even when you haven’t given it schema information.

The Feedback Loop and Retry Logic

The conditional edge in the graph routes the pipeline based on what the critic returns. But when the query builder receives a retry it needs to know why it’s being asked to rebuild, just routing back to it with the same state isn’t enough.

When the critic fails a query, the builder on the next pass needs access to the critique (one of the state keys). So the state carries it forward:

def query_builder_node(state: PipelineState) -> dict:
    previous_critique = state.get("critique")

    system_content = """You are a SQL query generation specialist.
Given schema-mapped intents, generate a single correct SQL query.
Return only the SQL. No explanation."""

    # On a retry, attach the critic's feedback
    if previous_critique and not previous_critique.get("passed"):
        issues = "\n".join(previous_critique.get("issues", []))
        system_content += f"\n\nPrevious attempt was rejected. Issues found:\n{issues}\nAddress these specifically."

    system = SystemMessage(content=system_content)
    human = HumanMessage(content=f"""
Intents: {json.dumps(state['intents'], indent=2)}
Schema mapping: {json.dumps(state['schema_mapping'], indent=2)}
""")

    response = llm.invoke([system, human])
    return {"generated_query": response.content.strip()}

This is the difference between a retry loop that actually improves the query and the one that just runs the same generation again and again.

What Breaks in Production

Context bleed between agents. LangGraph passes full state between nodes. By the time the critic runs, it has visibility into everything: original query, intents, schema mapping, and generated SQL. That’s intentional but it also means a subtly wrong intent decomposition early in the pipeline propagates downstream silently. The critic is evaluating the query against the intents, not re-evaluating whether the intents themselves were correct or not. Add a lightweight validation step after intent parsing to check that the decomposed intents actually cover the original question before moving forward.

Schema agents hallucinate on large schemas. If your database has 200 tables, injecting the entire schema into the context is not practical. The schema agent needs retrieved embeddings of the relevant tables based on the parsed intents. A vector search over table and column descriptions, returning the top 15–20 most relevant tables will keep the context manageable.

The retry loop hides bad intent parsing. When the critic rejects a query and routes back to the builder, the system looks like it’s doing the right thing but if the root cause is a bad intent decomposition the builder can’t fix it. It’s working from a wrong spec, and the critic will keep rejecting it for the same underlying reasons until you hit the retry ceiling.

Token costs compound fast. Five agents, each with a system prompt and context, running sequentially, add a retry loop and now you’ve made six LLM calls per request minimum, several of them with substantial context. Profile your token usage per node before deploying at any volume, especially the critic because it receives both the intents and the full query, which adds up and increases the context length. Using smaller models for validation tasks where the reasoning doesn’t need to be sophisticated is a reasonable optimisation.

Structured output parsing fails silently. Several nodes above return JSON that gets parsed directly. If the LLM returns malformed JSON the entire pipeline crashes at the parsing step. Wrap every json.loads() call in error handling and define a fallback.

When You Shouldn’t Do This

A multi-agent architecture adds real overhead, more failure points, harder debugging, higher token costs, and more complex state management. It should only be used when a single agent cannot do the job well.

If your SQL queries are simple and your schema is small, a well-prompted single agent with a retry loop will outperform this pipeline on every metric that matters, latency, cost, and maintainability. Don’t build a five-agent system to answer questions that map cleanly to two or three tables.

Build the single agent first and run it on real queries until it fails in ways a better prompt won’t fix. That failure tells you exactly where to introduce specialisation and you’ll build the multi-agent version with a much clearer head.

Conclusion

The text-to-SQL problem is a useful lens for a multi-agent architecture because the failure modes of a single agent are so concrete. You can point to the exact query it got wrong and explain exactly that it failed because the model was simultaneously trying to understand the question, know the schema, write SQL, and evaluate its own output.

Breaking that into specialised agents doesn’t make any individual agent smarter. It makes the system more reliable by narrowing what each agent is responsible for. A context window focused on one task produces better output than the same context window stretched across four. LangGraph gives you the ecosystem to build this without hiding how it works. When something breaks you can trace exactly which node failed and what state it received. That transparency matters more than it sounds when you’re debugging a production pipeline at midnight.

The architecture described here is a starting point, not a finished product. Real deployments will need schema retrieval for large databases, better structured output handling, observability tooling, and careful token budgeting. But the structure: specialised agents, explicit state, conditional routing, and a critic loop is the right foundation to build on.