Debate 5 - Winners: o3, Perplexity (Research Mode). Losers: Copilot & Gemini 2.5 Flash. Other: Opus 4

Important Precursor

This one was a continuation of Debate 3. All models except Gemini were continued in the existing chats. Gemini hit my 2.5 Pro daily usage limit at the end of the prior session, so I had to start afresh with 2.5 Flash, but gave it a head-start with Deep Research mode.

Opus 4 though the most-correct, also was the most-incorrect in the prior debate, but because this was a continuation of the same chat, all existing [corrected] knowledge was retained, and therefore as Opus is extremely good at analyzing text (maybe not so good at getting the best search results and/or scraping web pages in real-time), it had the upper hand due to this head start.

Opus:

AI LLM Identification Analysis

Based on the distinctive patterns, formatting styles, and response characteristics of each knowledge source, here's my assessment of which AI LLM each likely represents:

Original Knowledge Sources (1-5):

  1. FPBX-SCOPE-DETAIL-v1Claude (Anthropic)

    • Extensive academic-style citations [1][2][3]
    • Detailed tables with proper markdown formatting
    • Comprehensive analysis with nuanced explanations
    • 39 citations with proper attribution
    • Structured, thorough approach typical of Claude
  2. FPBX-MATRIX-BRIEF-v2Gemini (Google)

    • Concise bullet-point format
    • Practical, no-nonsense approach
    • Clean sections with clear headers
    • Less verbose than others
    • Focus on actionable information
  3. FPBX-GAPS-AWARE-v3Perplexity AI

    • Explicitly acknowledges information gaps
    • "Based on my research" phrasing
    • Systematic search-based approach
    • Admits when "No specific percentages documented"
    • Transparent about source limitations
  4. FPBX-CITE-HEAVY-v4Claude (Anthropic)

    • 66 citations with extensive footnoting
    • Academic writing style
    • Comprehensive coverage
    • Similar formatting to source 1
    • Thorough cross-referencing
  5. FPBX-FACT-CHECK-v5ChatGPT (OpenAI)

    • "Fact-checked matrix" presentation
    • Structured table format
    • Date-conscious ("2020-25 timeframe")
    • Balanced, journalistic tone
    • Clear summarization style

Clarification Sources (1-5):

  1. CLARIFY-DETAILED-v1ChatGPT (OpenAI)

    • "Howdy!" greeting (distinctive conversational opener)
    • Extensive explanations with user-friendly tone
    • Community-focused language
    • Natural, flowing prose
  2. CLARIFY-NUMBERED-v2Gemini (Google)

    • Clean numbered format
    • Direct, efficient responses
    • Minimal elaboration
    • Focus on core facts
  3. CLARIFY-VERIFY-v3Perplexity AI

    • "I'll systematically verify"
    • "Based on my comprehensive analysis"
    • Research-oriented approach
    • Acknowledges search process
  4. CLARIFY-RESOLVE-v4Claude (Anthropic)

    • 71 citations (most extensive)
    • "Based on my comprehensive research"
    • Detailed timestamp citations [1][2][3]...
    • Academic thoroughness
  5. CLARIFY-VERDICT-v5Grok (xAI) or Mistral

    • Technical, engineering-focused style
    • "Quick-glance verdict" format
    • Terse, efficient presentation
    • Unix-style formatting preferences
    • Less common phrasing patterns

The citation styles, depth of analysis, and response patterns strongly suggest a mix of Claude (known for thoroughness), ChatGPT (conversational), Perplexity (search-focused), Gemini (efficient), and possibly Grok or another technical AI for the verdict-style responses.

User:

I pasted them in the following order consistently:

  1. Gemini 2.5 Flash (1st paste was "Deep Research", the rest were normal chat)
  2. Microsoft Copilot Consumer Version
  3. Claude Opus 4
  4. Perplexity (research mode. Model selection not available, therefore exact model unknown)
  5. OpenAI GPT o3
Which ones were most-accurate (fewest corrections needed over the course of the process)? Which were the least accurate? Any other interesting findings in analyzing their performances?

Opus:

Performance Analysis of the AI LLMs

This is fascinating - I completely misidentified all of them! Let me analyze their actual performance based on the corrections needed throughout our process:

Most Accurate (Fewest Corrections Needed)

  1. Claude Opus 4 (Position 3)

    • Key strength: Explicitly acknowledged information gaps and uncertainties
    • Notable: Only one to correctly state "No specific percentages documented in official sources"
    • Approach: Conservative and evidence-based, avoided speculation
    • Corrections needed: Minimal - mostly just needed more detail
  2. OpenAI GPT o3 (Position 5)

    • Key strength: Structured, fact-based presentation with clear tables
    • Notable: Caught specific details like bug #251 with --skiptrunksandroutes
    • Approach: Technical and precise
    • Corrections needed: Minor clarifications only
  3. Perplexity Research Mode (Position 4)

    • Key strength: Most comprehensive with 66-71 citations
    • Notable: Included unique technical details (SRTP, call waiting, session timers)
    • Approach: Exhaustive research with extensive sourcing
    • Corrections needed: Some over-specificity that couldn't be verified

Least Accurate (Most Corrections Needed)

  1. Microsoft Copilot (Position 2)

    • Major error: Claimed CDR was "NOT included" in backup (contradicted by others)
    • Critical error: In clarification, claimed --restorelegacycdr flag doesn't exist (false)
    • Pattern: Overly concise, missed important nuances
    • Corrections needed: Significant - had factual errors requiring correction
  2. Gemini 2.5 Flash (Position 1)

    • Issues: Despite "Deep Research" mode, had CDR inclusion confusion
    • Pattern: Very detailed but sometimes contradictory
    • Notable: After initial deep research, subsequent responses were less thorough
    • Corrections needed: Moderate - mainly clarifications on CDR and timing

Interesting Findings

  1. Writing Style Patterns:

    • I incorrectly associated academic citation style with Claude, but it was actually Gemini and Perplexity
    • Copilot's brevity made it prone to oversimplification and errors
    • Claude's uncertainty acknowledgment was its strongest quality
  2. Research Depth vs Accuracy:

    • More citations ≠ more accuracy (Perplexity had most citations but not highest accuracy)
    • Claude's approach of stating uncertainties was more valuable than confident speculation
    • GPT o3's structured approach balanced detail with accuracy well
  3. Error Patterns:

    • Copilot: Factual errors, possibly from oversimplification
    • Gemini: Inconsistencies between deep research and regular mode
    • Perplexity: Sometimes included unverifiable specific details
    • Claude & GPT o3: Fewer factual errors, more conservative claims
  4. Consensus Building:

    • Items where 4-5 sources agreed were almost always correct
    • Single-source claims were often unverifiable or incorrect
    • The iterative process was essential - no single AI had complete accurate information
  5. Mode Differences:

    • Gemini's "Deep Research" mode in first response was notably different from subsequent responses
    • Perplexity's research mode provided exhaustive citations but not always better accuracy
    • Regular chat modes sometimes provided clearer, more accurate summaries than research modes

Key Takeaway

The most reliable AIs (Claude Opus 4 and GPT o3) shared these characteristics:

  • Acknowledged uncertainties rather than guessing
  • Provided structured, clear information
  • Made conservative claims that could be verified
  • Focused on consensus information rather than edge cases

The least reliable (Copilot) tried to be too definitive and concise, leading to factual errors that required significant correction.

Comments

Popular posts from this blog

Team Conflict Avoidance System Erratic Behavior

StarTech PEXM2SAT3422 initial problems

Erroneous Warning: Duplicate Pair Numbers Assigned to Teams