Mate Kopaliani

Giorgi Gogsadze

Multi-LLM Debate System

A Collaborative AI Problem-Solving Framework

OpenAI Anthropic Google xAI

Overview

Orchestrates multiple LLMs from different providers to solve complex problems through structured debate, peer review, and judging mechanisms.

Problem Types

Mathematics
Logic Puzzles
Scientific Reasoning

AI Providers

GPT (OpenAI)
Claude (Anthropic)
Gemini (Google)
Grok (xAI)

Four-Stage Workflow

1

Role Assignment

Agents choose Solver or Judge roles

2

Solution Generation

Solvers work independently with detailed reasoning

3

Peer Review

Exchange feedback and refine solutions

4

Judgment

Judge selects winner with justification

Workflow Execution

runner = Runner(agents=[
    openai, anthropic, 
    google, xai
])

for problem in problems:
    # Stage 1
    roles = await arbiter.assign_roles(
        problem, agents
    )
    
    # Stage 2
    solutions = await runner.solve(
        problem
    )
    
    # Stage 3
    refined = await runner.peer_review(
        solutions
    )
    
    # Stage 4
    winner = await runner.judge(
        refined
    )

System Architecture

graph TB
    subgraph Input["Input Layer"]
        Problems[Problem Dataset
JSON]
        Config[Configuration
API Keys]
    end

    subgraph Core["Core Orchestration"]
        Runner[Runner
Workflow Manager]
        Bus[Message Bus
Communication Hub]
        Arbiter[Arbiter
Role Assignment]
    end

    subgraph Agents["LLM Agents"]
        OpenAI[OpenAI Agent
GPT Models]
        Anthropic[Anthropic Agent
Claude Models]
        Google[Google Agent
Gemini Models]
        XAI[xAI Agent
Grok Models]
    end

    subgraph Infra["Infrastructure"]
        ClientO[OpenAI Client]
        ClientA[Anthropic Client]
        ClientG[Google Client]
        ClientX[xAI Client]
    end

    subgraph Processing["Debate Stages"]
        Stage1[Stage 1:
Role Selection]
        Stage2[Stage 2:
Solution Generation]
        Stage3[Stage 3:
Peer Review]
        Stage4[Stage 4:
Judgment]
    end

    subgraph Output["Output & Analysis"]
        Results[Debate Results
Winner Selection]
        Eval[Evaluator
Performance Analysis]
        Reports[Reports & Plots
Visualization]
    end

    Problems --> Runner
    Config --> Runner
    
    Runner --> Arbiter
    Arbiter --> Stage1
    
    Runner --> Bus
    Bus <--> OpenAI
    Bus <--> Anthropic
    Bus <--> Google
    Bus <--> XAI
    
    OpenAI --> ClientO
    Anthropic --> ClientA
    Google --> ClientG
    XAI --> ClientX
    
    Runner --> Stage1
    Stage1 --> Stage2
    Stage2 --> Stage3
    Stage3 --> Stage4
    Stage4 --> Results
    
    Results --> Eval
    Eval --> Reports

    style Input fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px
    style Core fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px
    style Agents fill:#1a1a1a,stroke:#42be65,stroke-width:3px
    style Infra fill:#1a1a1a,stroke:#ff832b,stroke-width:3px
    style Processing fill:#1a1a1a,stroke:#ee5396,stroke-width:3px
    style Output fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px

Message Bus Architecture

graph LR
    subgraph Agents["LLM Agents"]
        A1[OpenAI]
        A2[Anthropic]
        A3[Google]
        A4[xAI]
    end

    subgraph Message["AgentMessage"]
        Sender[sender]
        Recipient[recipient]
        MsgType["message_type
(enum)"]
        Payload["payload
(data)"]
    end

    subgraph Bus["AgentBus"]
        Send[send]
        SendJudge[send_to_judge]
        Broadcast[broadcast]
    end

    A1 --> Message
    A2 --> Message
    A3 --> Message
    A4 --> Message

    Message --> Bus
    
    Bus --> A1
    Bus --> A2
    Bus --> A3
    Bus --> A4

    style Agents fill:#1a1a1a,stroke:#78a9ff,stroke-width:3px
    style Bus fill:#1a1a1a,stroke:#42be65,stroke-width:3px
    style Message fill:#1a1a1a,stroke:#ee5396,stroke-width:3px

MessageType (Enum)

class MessageType(Enum):
    SOLUTION = 1
    REVIEW = 2
    REFINEMENT = 3
    JUDGE = 4

Payload Types

SolutionPayload
ReviewPayload
RefinementPayload
JudgmentPayload

Asynchronous Non-blocking Type-safe Scalable

Message Bus Implementation

Publishing Messages

# Broadcast to all agents
await bus.publish(
    MessageType.SOLUTION,
    SolutionPayload(
        agent_name="OpenAI",
        solution="...",
        confidence=0.92
    ),
    broadcast=True
)

# Direct message
await bus.send_to(
    target="Anthropic",
    message_type=MessageType.REVIEW,
    payload=ReviewPayload(...)
)

Subscribing to Messages

# Subscribe to message types
bus.subscribe(
    MessageType.SOLUTION,
    solver_agent.handle_solution
)

bus.subscribe(
    MessageType.PEER_REVIEW,
    solver_agent.handle_review
)

# Message handler
async def handle_solution(
    self, payload: SolutionPayload
):
    # Process solution
    pass

Asynchronous Non-blocking Type-safe Payloads Broadcast & Direct

Getting Started & Problem Format

Installation & Setup

# Install dependencies
pip install -r requirements.txt

# Configure .env
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...
XAI_API_KEY=xai-...

# Run pipeline
python main.py

Problem JSON

{
  "problems": [{
    "id": 1,
    "category": "Mathematics",
    "problem": "Train travels...",
    "answer": "72 km/h",
    "solution": "Step by step..."
  }]
}

Base Agent Interface

class BaseAgent:
    async def express_role_preference(self, problem: str) -> RoleDTO:
        """Agent analyzes problem and chooses role"""
        
    async def solve(self, problem: str) -> Solution:
        """Generate solution with reasoning"""
        
    async def peer_review(self, solutions: List[Solution]) -> List[Review]:
        """Review other agents' solutions"""
        
    async def judge(self, all_data: DebateData) -> Judgment:
        """Select winning solution"""

Evaluation & Key Features

Evaluation System

Performance Metrics

Accuracy vs ground truth
Confidence calibration
Cross-agent comparison

Visualization

System performance plots
Baseline comparisons
Agent-specific metrics

Key Features

Modular Design

Easy to add new LLM providers through base agent interface

Robust Error Handling

Automatic retries, validation, graceful degradation

Flexible Prompts

Specialized prompts for each stage, easy to customize

Why Multi-LLM Debate?

Diverse Strengths

Different models excel at different problem types

Error Detection

Peer review catches mistakes individual models miss

Improved Accuracy

Collaboration leads to better final solutions

Collective Intelligence

Combines reasoning approaches from multiple AI systems

Future Possibilities

More Providers & Problem Types

Integrate Mistral, Cohere. Expand to code generation, creative writing

Dynamic Strategies & Web Interface

Adaptive debate structures. Real-time visualization of debates

Thank You

Multi-LLM Debate System

Collaborative AI Problem-Solving

github.com/Ka10ken1/llm-final