
Grounded in peer-reviewed work on multi-agent reasoningⓘ, Qplural runs five independent frontier models over the same question with live web retrieval and round-two cross-critique — improving factuality, strengthening reasoning, and reducing hallucinations.
The problem
Every frontier language model — GPT, Claude, Gemini, Grok, DeepSeek — was trained on overlapping data, tuned with similar techniques, and optimised for similar benchmarks. Ask the same hard question to any one of them and you get a fluent, assertive answer that often sounds more certain than it has any right to be. Hallucinations, stale facts, blind spots, subtle bias — all of it comes out wearing the same confident voice.
The standard fixes so far — better prompting, more retrieval, bigger models — reduce mistakes but don’t surface the ones that remain. If the model is wrong, you don’t usually find out until you act on the answer.
The research answer
The last three years of multi-agent debate research — at ICLR, NeurIPS, ACL and EMNLP — have substantially sharpened the picture. The foundational result (Du et al., 20231) showed the basic mechanism: multiple language models that answer independently and then read each other’s reasoning catch errors any single model would defend. One model alone will assert a wrong answer confidently; several reading each other’s working will often surface the flaw.
Since then the programme has tightened considerably. Heterogeneity — models from different labs, not copies of the same one2 — matters more than sheer agent count. Handing each agent a different slice of the retrieved evidence beats letting all of them anchor on the same sources8. Hiding peer confidence from other agents prevents over-confidence cascades6. Auditing disagreement points in the transcript recovers correct minority answers that majority voting loses entirely7.
Qplural implements these findings together — and one more: a recent preprint5 calls it “architectural heterogeneity” and argues it’s what prevents consensus collapse, the failure mode where a panel of models from the same lab confidently converge on the same wrong answer because they inherited the same biases in training.
What we do
When you ask Qplural a hard question, an orchestrator model plans the research, five frontier models answer in parallel against partitioned web evidence, a separate reviewer stress-tests their briefs and commissions a second round of targeted research, and the five models revise. A final blinded synthesis pass reads the whole transcript and writes one answer with inline citations. Every stage is visible in the UI — you can audit any of it — but what you read is the synthesis, not five answers to reconcile yourself.
Interpret and retrieve
An orchestrator model reads your question, commissions five research briefs targeted at different facets of it, and pulls live web evidence in three parallel framings — neutral, supportive, and challenging. The evidence is partitioned across the five researchers so each reads from a different slice, not a shared pool.
Five frontier models answer in parallel
Each model — one each from OpenAI, Anthropic, Google DeepMind, xAI, and DeepSeek — writes to its brief using its own evidence subset. No lab leans on the same sources, so any later agreement is stronger evidence than five models reading the same article.
Verification — cross-critique and targeted re-retrieval
A separate reviewer model reads all five briefs together, flags gaps, contradictions, and unresolved claims, and commissions a second round of targeted research aimed precisely at those weak points. Fresh web evidence is pulled to verify what the first round left open. This is the verification step: the answer is not allowed to rest on the first attempt.
Five models revise against the critique
The five researchers run again — this time with visibility of their peers’ first-round work and access to the new evidence. They tighten, concede, or sharpen where the reviewer found gaps. This is the debate literature’s core loop: independent proposal, peer review, revise.
Blinded synthesis
A separate synthesis pass reads the complete transcript and produces one concise final answer with inline citations back to every source used. The synthesiser does not participate in the debate — it adjudicates it.
The cross-critique in stage 3 is the debate literature’s core result running live: independent proposal, peer review, revise. Every claim in the final answer lands with a citation back into the transcript, so you can see exactly where it came from.
Why five?
We chose five in light of recent multi-agent debate research suggesting that debate quality is driven less by a single “correct” number of agents than by two underlying conditions: first, the presence of a sufficiently diverse initial pool of candidate answers, and second, a deliberation process that can meaningfully revise beliefs in response to disagreement.
Zhu et al.4 study a five-agent, five-turn debate setting and show that performance improves when the initial debate pool is made more diverse. Chen et al.2 further show that consensus quality improves when agents are drawn from different model families rather than from repeated instances of the same model. Du et al.1 also report that debate performance can improve as the number of participating agents increases, while Liang et al.3 motivate debate as a way to counter the Degeneration-of-Thought problem that emerges when a single model becomes locked into its initial reasoning path.
Taken together, these findings do not imply that five is a universal optimum; rather, they make five a principled operating point: large enough to increase the probability that a strong answer is present at initialisation and that genuine disagreement can surface, yet small enough to keep deliberation computationally tractable.
Meet the panel
Every run uses the latest flagship from each of five frontier labs. The panel is intentionally heterogeneous — different training data, different post-training techniques, different reasoning habits — because that’s where the debate literature’s factuality gains come from. The synthesis model does not participate in the debate; it reads the transcript blind.
Why it matters
When all five analysts converge on the same answer — from different priors, looking at different sources — that is much stronger evidence than a single model’s confident assertion. When they disagree, the reviewer’s second round of research is aimed precisely where the disagreement lives. Often the disagreement is the most valuable part of the answer: it shows which parts of a question are solid and which parts still require judgement.
Qplural is for the questions where you’d rather know the panel is uncertain than be told a confident wrong thing.
References
Peer-reviewed here means accepted at ICML / ICLR / ACL / EMNLP / NeurIPS — not “on arXiv.” arXiv preprints that haven’t cleared a conference are labelled emerging.
ICML 2024 · peer-reviewed
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Du, Li, Torralba, Tenenbaum & Mordatch
The foundational result: multiple models reading each other’s reasoning catch errors a single model defends. Shows debate performance can improve as the number of participating agents increases.
Read on arXivACL 2024 · peer-reviewed
ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs
Chen, Saha & Bansal
Shows consensus quality improves when agents are drawn from different model families rather than from repeated instances of the same model, and that a transcript-level judge outperforms majority voting.
Read on arXivEMNLP 2024 · peer-reviewed
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
Liang et al.
Motivates debate as a way to counter the Degeneration-of-Thought problem that emerges when a single model becomes locked into its initial reasoning path.
Read on arXivarXiv 2026 · emerging
Demystifying Multi-Agent Debate
Zhu et al.
Studies a five-agent, five-turn debate setting and shows performance improves when the initial debate pool is made more diverse and when agents communicate calibrated confidence during revision.
Read on arXivarXiv 2026 · emerging
Heterogeneous Debate Engine: Identity-Grounded Cognitive Architecture for Resilient LLM-Based Ethical Tutoring
HDE paper
Argues that architectural heterogeneity — models from different labs — prevents “consensus collapse”, where homogeneous panels share the same training biases and confidently converge on the same wrong answer.
Read on arXivarXiv 2025 · emerging
Enhancing Multi-Agent Debate System Performance via Confidence Expression
Wu et al.
Finds that when debating agents see each other’s confidence scores the panel drifts toward over-confidence and loses signal. Informs the Qplural design choice that cross-critique turns on reasoning and disconfirmation conditions, not assertiveness.
Read on arXivarXiv 2026 · emerging
Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge
AgentAuditor paper
Shows that adjudicating at divergence points — by comparing localised branch evidence — beats both majority vote and generic LLM-as-judge, recovering correct minority answers where voting loses them entirely. Supports the Qplural design of a blinded synthesis pass over the full transcript.
Read on arXivarXiv 2025 · emerging
Retrieval-Augmented Generation with Conflicting Evidence (MADAM-RAG)
Wang, Prasad, Stengel-Eskin & Bansal
Assigns each agent a different subset of the retrieved evidence, then lets them debate. Reports factuality gains of 11–16 percentage points on benchmarks with ambiguous or conflicting documents. Basis for the Qplural per-analyst evidence partitioning: agreement reached by analysts reading different sources is much stronger evidence than agreement when all five read the same article.
Read on arXivIndependent models: ChatGPT, Claude, Gemini, Grok, DeepSeek.
Your questions are never shared. Your answers are private to you.