Building Multi-Agent Systems: A Supervisor Architecture Deep Dive

How I built a research assistant that thinks like a team of specialists—and why single-agent approaches are hitting their limits

Jun 15, 2025

The Problem with Smart-Enough AI

I was staring at yet another GPT-4 response that was almost right. It had summarized the research paper decently, identified key concepts, but completely missed the nuanced relationships between different papers in my collection.

It reminded me of monolithic applications—trying to handle everything in one massive codebase. They worked... until they didn't.

Maybe it's time AI architecture caught up with what we learned in software engineering.

The frustration isn't just about accuracy—it's about the cognitive labor that remains. When an AI gets 80% of the way there, you're left with the most challenging 20%: the synthesis, the connections, the implications that require genuine understanding rather than pattern matching.

This "almost right" phenomenon is particularly insidious in research analysis. A completely wrong answer signals immediate caution. But a response that demonstrates clear competence while quietly missing crucial nuances? That's where the real danger lies—not just in missed insights, but in the false confidence it can instill.

What if AI worked more like humans do?

The breakthrough insight came from recognizing how we actually handle complex research. When tackling intricate problems, you don't do everything yourself—you assemble a team. A research analyst gathers background information, subject matter experts weigh in on technical details, data scientists identify patterns, and someone coordinates everything into coherent insights.

This is precisely what multi-agent systems can replicate: specialized agents working together rather than one generalist trying to handle every task.

From Code to Cognition: Why Architecture Matters

The parallels between software architecture evolution and AI system design are striking. We learned that monolithic applications become unmaintainable as complexity grows. The solution? Break them into focused microservices, each excellent at one thing.

Here's what that looks like in practice with my research system:

# Instead of this monolithic approach:
def analyze_research(query):
    # Extract entities AND find relationships AND identify themes 
    # AND synthesize insights AND... 
    return one_massive_response

# We build this:
class ResearchCoordinator:
    def __init__(self):
        self.relationship_analyst = RelationshipAnalyst()  # Neo4j specialist
        self.theme_analyst = ThemeAnalyst()               # MongoDB specialist
        
    def analyze(self, query):
        # Route to appropriate specialists
        # Synthesize their focused expertise
        # Return coordinated insights

Each specialist in my system has its own "cognitive architecture"—not just different prompts, but entirely different data structures and reasoning approaches:

User Query
    ↓
Query Classifier
    ↓
┌─────────────┬─────────────┬─────────────┐
│   Greeting  │   Simple    │  Research   │
│   Response  │  Question   │   Query     │
└─────────────┴─────────────┴─────────────┘
                                  ↓
                            Planner
                                  ↓
                    ┌─────────────┬─────────────┐
                    │Relationship │Theme        │
                    │Analyst      │Analyst      │
                    │(Neo4j)      │(MongoDB)    │
                    └─────────────┴─────────────┘
                                  ↓
                          Synthesizer
                                  ↓
                        Final Response

Designing the Team: Agent Specialization Strategy

The key insight is that different types of analysis require fundamentally different approaches to data and reasoning. You can't just throw everything into a vector database and hope for the best.

The Relationship Analyst: Connection Detective

When someone asks "How do neural networks relate to computer vision?", they're not looking for definitions—they want to understand the evolution, the key papers that bridged concepts, the researchers who made critical connections.

# src/domain/agents/relationship_analyst.py
@tool
def analyze_research_relationships(query: str) -> str:
    """Analyze relationships between research entities using Neo4j graph database."""
    
    # Query the knowledge graph
    graph_data = query_graphdb(query)
    
    concepts = graph_data.get("concepts", [])
    relationships = graph_data.get("relationships", [])
    papers = graph_data.get("papers", [])
    
    # This agent thinks in terms of nodes and edges
    # Author → Paper → Concept → Related_Concept

The Relationship Analyst doesn't just search for keywords—it traverses a knowledge graph built specifically for understanding research lineages:

// Neo4j Schema - Purpose-built for relationship reasoning
(:Paper {id, title, year, research_field})-[:CONTAINS]->(:Concept)
(:Author)-[:AUTHORED]->(:Paper)
(:Concept)-[:RELATES_TO {type, description}]->(:Concept)
(:Paper)-[:CITES]->(:Paper)

The Theme Analyst: Pattern Recognition Specialist

While the Relationship Analyst maps explicit connections, the Theme Analyst identifies latent patterns across document collections:

# src/domain/agents/theme_analyst.py
@tool  
def analyze_research_themes(query: str) -> str:
    """Analyze themes and topics using MongoDB document database."""
    
    results = query_mongodb(query)
    
    topics = results.get("topics", {})  # Hierarchical topic structure
    papers = results.get("papers", [])  # Full-text searchable documents
    
    # This agent thinks in terms of themes, frequencies, evolution

The MongoDB schema supports this different kind of reasoning:

// MongoDB Collections - Optimized for thematic analysis
{
    paper_id: String,
    metadata: {title, authors, year, keywords, research_field},
    content: [{page, text}],  // Full-text searchable
    entities: {concepts, relationships},
    topics: [{category, terms, weights}]  // Hierarchical themes
}

The Coordination Challenge: Building the "Manager"

The most sophisticated part isn't the individual agents—it's the coordination logic. This is where LangGraph shines compared to naive prompt chaining.

# src/domain/agents/research_coordinator.py
class ResearchState(MessagesState):
    """Modern state schema inheriting from MessagesState"""
    query_type: str = "unknown"
    analysis_plan: str = ""
    needs_relationship: bool = False
    needs_theme: bool = False
    transfer_context: str = ""

def query_classification_node(state: ResearchState) -> Command:
    """Classifies queries and determines routing strategy."""
    
    # Intelligent routing based on query analysis
    classification_prompt = [
        {"role": "system", "content": """
You are a research query classifier. Analyze queries and respond with JSON:

{
  "classification": "GREETING|SIMPLE_QUESTION|RESEARCH_QUERY", 
  "needs_relationship": true/false,
  "needs_theme": true/false,
  "reasoning": "brief explanation"
}
"""},
        {"role": "user", "content": query}
    ]
    
    # Route based on actual query requirements, not just keywords
    if classification == "RESEARCH_QUERY":
        return Command(goto="planning", update={...})
    else:
        return Command(goto="direct_response", update={...})

This isn't just fancy routing—it's about matching cognitive load to capability. Simple questions get simple answers. Complex research queries get the full specialist treatment.

The planning node then creates an analysis strategy:

def planning_node(state: ResearchState) -> Command:
    """Creates analysis plan and routes to appropriate specialists."""
    
    # Dynamic planning based on query requirements
    if state.get("needs_relationship") and state.get("needs_theme"):
        # Both specialists needed - determine optimal order
        next_agent = "relationship_analyst"  # Start with structure
    elif state.get("needs_relationship"):
        next_agent = "relationship_analyst"  # Just connections
    elif state.get("needs_theme"):
        next_agent = "theme_analyst"  # Just patterns
    else:
        next_agent = "synthesis"  # Edge case handling
        
    return Command(goto=next_agent, update={"analysis_plan": plan})

Database Architecture as Cognitive Architecture

Here's where it gets interesting: each agent doesn't just have different prompts—they have entirely different data structures optimized for their reasoning style.

Neo4j for the Relationship Analyst: Graph traversal, shortest paths, centrality measures

MATCH (concept1:Concept)-[:RELATES_TO*1..3]-(concept2:Concept)
WHERE concept1.name CONTAINS $query
RETURN concept1, concept2, shortestPath((concept1)-[*]-(concept2))

MongoDB for the Theme Analyst: Full-text search, aggregation pipelines, topic hierarchies

db.papers.aggregate([
  {$match: {$text: {$search: query}}},
  {$unwind: "$topics"},
  {$group: {_id: "$topics.category", papers: {$addToSet: "$title"}}}
])

ChromaDB for Semantic Similarity: Vector search across both agents when needed

collection.query(
    query_texts=[query],
    n_results=limit,
    include=["documents", "metadatas", "distances"]
)

This multi-database approach isn't just about storage—it's about giving each agent the right tools for its type of reasoning.

Real-World Performance: What Actually Works

Let me show you what this looks like in practice. Here's a query I ran during development:

python cli.py demo --query "How do the concepts of 'critique ability' and 'self-correction' compare between CRITIC and CritiqueLLM research papers?"

Single-Agent Response (what GPT-4 alone would give):

"CRITIC and CritiqueLLM both focus on improving language model capabilities through critique mechanisms. CRITIC emphasizes iterative refinement while CritiqueLLM focuses on training procedures..."

Multi-Agent Response (from my system):

🔗 Relationship Analysis (Neo4j Database Results)
Found direct conceptual connections between CRITIC (Gou et al.) and CritiqueLLM papers:
- Both reference "self-correction" as a core capability
- CRITIC treats critique as an external verification step  
- CritiqueLLM embeds critique into the training process
- Cross-citation pattern shows methodological evolution

📊 Topic Analysis (MongoDB Database Results)  
Thematic analysis across 127 critique-related papers:
- "Critique ability" appears in 23 papers with 3 distinct definitions
- Self-correction methodologies cluster into test-time vs training-time
- CRITIC represents external critic paradigm (15 papers)
- CritiqueLLM represents integrated critic paradigm (8 papers)

💡 Synthesis
The fundamental difference isn't just implementation—it's philosophy...

Notice the difference? The multi-agent system provides:

1. Specific evidence from the knowledge graph

2. Quantified patterns from document analysis

3. Transparent reasoning about how conclusions were reached

More importantly, I can see exactly which agent contributed which insight, making it easy to verify or debug specific claims.

Want to see this multi-agent approach in action? I've open-sourced the complete system so you can experiment with it yourself.

The repository is available here: Multi-Agent Research System

Everything's included: LangGraph coordination, multi-database setup, and a professional CLI that lets you test sophisticated research queries in minutes.

Discussion about this post

Ready for more?