Mate Kopaliani
Giorgi Gogsadze

Multi-LLM Debate System

A Collaborative AI Problem-Solving Framework

OpenAI Anthropic Google xAI

Overview

Orchestrates multiple LLMs from different providers to solve complex problems through structured debate, peer review, and judging mechanisms.

Problem Types

  • Mathematics
  • Logic Puzzles
  • Scientific Reasoning

AI Providers

  • GPT (OpenAI)
  • Claude (Anthropic)
  • Gemini (Google)
  • Grok (xAI)

Four-Stage Workflow

1

Role Assignment

Agents choose Solver or Judge roles

2

Solution Generation

Solvers work independently with detailed reasoning

3

Peer Review

Exchange feedback and refine solutions

4

Judgment

Judge selects winner with justification

Workflow Execution

runner = Runner(agents=[
    openai, anthropic, 
    google, xai
])

for problem in problems:
    # Stage 1
    roles = await arbiter.assign_roles(
        problem, agents
    )
    
    # Stage 2
    solutions = await runner.solve(
        problem
    )
    
    # Stage 3
    refined = await runner.peer_review(
        solutions
    )
    
    # Stage 4
    winner = await runner.judge(
        refined
    )

System Architecture

graph TB
    subgraph Input["Input Layer"]
        Problems[Problem Dataset
JSON] Config[Configuration
API Keys] end subgraph Core["Core Orchestration"] Runner[Runner
Workflow Manager] Bus[Message Bus
Communication Hub] Arbiter[Arbiter
Role Assignment] end subgraph Agents["LLM Agents"] OpenAI[OpenAI Agent
GPT Models] Anthropic[Anthropic Agent
Claude Models] Google[Google Agent
Gemini Models] XAI[xAI Agent
Grok Models] end subgraph Infra["Infrastructure"] ClientO[OpenAI Client] ClientA[Anthropic Client] ClientG[Google Client] ClientX[xAI Client] end subgraph Processing["Debate Stages"] Stage1[Stage 1:
Role Selection] Stage2[Stage 2:
Solution Generation] Stage3[Stage 3:
Peer Review] Stage4[Stage 4:
Judgment] end subgraph Output["Output & Analysis"] Results[Debate Results
Winner Selection] Eval[Evaluator
Performance Analysis] Reports[Reports & Plots
Visualization] end Problems --> Runner Config --> Runner Runner --> Arbiter Arbiter --> Stage1 Runner --> Bus Bus <--> OpenAI Bus <--> Anthropic Bus <--> Google Bus <--> XAI OpenAI --> ClientO Anthropic --> ClientA Google --> ClientG XAI --> ClientX Runner --> Stage1 Stage1 --> Stage2 Stage2 --> Stage3 Stage3 --> Stage4 Stage4 --> Results Results --> Eval Eval --> Reports style Input fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px style Core fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px style Agents fill:#1a1a1a,stroke:#42be65,stroke-width:3px style Infra fill:#1a1a1a,stroke:#ff832b,stroke-width:3px style Processing fill:#1a1a1a,stroke:#ee5396,stroke-width:3px style Output fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px

Message Bus Architecture

graph LR
    subgraph Agents["LLM Agents"]
        A1[OpenAI]
        A2[Anthropic]
        A3[Google]
        A4[xAI]
    end

    subgraph Message["AgentMessage"]
        Sender[sender]
        Recipient[recipient]
        MsgType["message_type
(enum)"] Payload["payload
(data)"] end subgraph Bus["AgentBus"] Send[send] SendJudge[send_to_judge] Broadcast[broadcast] end A1 --> Message A2 --> Message A3 --> Message A4 --> Message Message --> Bus Bus --> A1 Bus --> A2 Bus --> A3 Bus --> A4 style Agents fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px style Bus fill:#1a1a1a,stroke:#42be65,stroke-width:3px style Message fill:#1a1a1a,stroke:#ee5396,stroke-width:3px

MessageType (Enum)

class MessageType(Enum):
    SOLUTION = 1
    REVIEW = 2
    REFINEMENT = 3
    JUDGE = 4

Payload Types

SolutionPayload
ReviewPayload
RefinementPayload
JudgmentPayload
Asynchronous Non-blocking Type-safe Scalable

Message Bus Implementation

Publishing Messages

# Broadcast to all agents
await bus.publish(
    MessageType.SOLUTION,
    SolutionPayload(
        agent_name="OpenAI",
        solution="...",
        confidence=0.92
    ),
    broadcast=True
)

# Direct message
await bus.send_to(
    target="Anthropic",
    message_type=MessageType.REVIEW,
    payload=ReviewPayload(...)
)

Subscribing to Messages

# Subscribe to message types
bus.subscribe(
    MessageType.SOLUTION,
    solver_agent.handle_solution
)

bus.subscribe(
    MessageType.PEER_REVIEW,
    solver_agent.handle_review
)

# Message handler
async def handle_solution(
    self, payload: SolutionPayload
):
    # Process solution
    pass
Asynchronous Non-blocking Type-safe Payloads Broadcast & Direct

Getting Started & Problem Format

Installation & Setup

# Install dependencies
pip install -r requirements.txt

# Configure .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
XAI_API_KEY=xai-...

# Run pipeline
python main.py

Problem JSON

{
  "problems": [{
    "id": 1,
    "category": "Mathematics",
    "problem": "Train travels...",
    "answer": "72 km/h",
    "solution": "Step by step..."
  }]
}

Base Agent Interface

class BaseAgent:
    async def express_role_preference(self, problem: str) -> RoleDTO:
        """Agent analyzes problem and chooses role"""
        
    async def solve(self, problem: str) -> Solution:
        """Generate solution with reasoning"""
        
    async def peer_review(self, solutions: List[Solution]) -> List[Review]:
        """Review other agents' solutions"""
        
    async def judge(self, all_data: DebateData) -> Judgment:
        """Select winning solution"""

Evaluation & Key Features

Evaluation System

Performance Metrics

  • Accuracy vs ground truth
  • Confidence calibration
  • Cross-agent comparison

Visualization

  • System performance plots
  • Baseline comparisons
  • Agent-specific metrics

Key Features

Modular Design

Easy to add new LLM providers through base agent interface

Robust Error Handling

Automatic retries, validation, graceful degradation

Flexible Prompts

Specialized prompts for each stage, easy to customize

Why Multi-LLM Debate?

Diverse Strengths

Different models excel at different problem types

Error Detection

Peer review catches mistakes individual models miss

Improved Accuracy

Collaboration leads to better final solutions

Collective Intelligence

Combines reasoning approaches from multiple AI systems

Future Possibilities

More Providers & Problem Types

Integrate Mistral, Cohere. Expand to code generation, creative writing

Dynamic Strategies & Web Interface

Adaptive debate structures. Real-time visualization of debates

Thank You

Multi-LLM Debate System

Collaborative AI Problem-Solving