Disagreement among frontier LLMs on real-world fact-checks

📝 Discussion Summary (Click to expand)

5prevailing themes in the HN discussion

Forced‑choice rubric inflates disagreement – the 4‑bucket (True / Mostly True / Misleading / False) prompt makes any label mismatch count as a “disagreement,” so the reported 67 % figure is really a floor on rubric inconsistency.

“the 67% figure is a floor on rubric inconsistency (at least one model is label‑inconsistent under the 4‑bucket rubric on 67% of claims), not ‘model X is factually wrong on claim Y.’” — kostaj
No “I don’t know” / abstain option forces guessing – models are compelled to output a label even when they lack confidence, which skews the disagreement numbers.

“Very good point to measure the inter‑model disagreement! Will add in the next version.” — kostaj (referring to plans to reintroduce an abstain bucket)
Disagreement ≠ correctness without a ground‑truth baseline – the study measures only inter‑model variance, not which model (if any) is right; without a vetted human reference the 67 % metric tells us little.

“The title mention ‘fact‑checks’, but ‘fact checking’ is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game.” — throw310822
Even retrieval‑augmented models disagree; knowledge cut‑offs limit answers – adding search doesn’t eliminate conflict, and many claims are unanswerable by parametric models alone.

“Without a search tool the only correct answer to ‘On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia’ is ‘this claim is impossible for me to verify’. And that wasn’t an option.” — simonw
“Two of the models used have retrieval capabilities and have access to newer information through search. The disagreement between them is still quite significant – 42%.” — kostaj
Skepticism toward headline results and call for better methodology – many argue the headline “67 % disagreement” is sensationalized; a more meaningful figure is the 34 % of claims with substantive (polar‑opposite) verdicts.

“In section 2, 34% of cases are found to have ‘substantive’ disagreements differing by 2 or more buckets – True + Misleading, Mostly True + False, or True + False.” — pjdesno

These themes capture the core critiques, methodological concerns, and suggestions for improvement that dominate the Hacker News commentary.

🚀 Project Ideas

[ConsensusCheck]

Summary

Aggregates multiple frontier LLM verdicts on user‑submitted claims, adds web‑search, rationale, and an optional “Abstain” bucket.
Reduces arbitrary disagreement and surfaces confidence scores, giving fact‑checkers a reliable decision layer.

Details

Key	Value
Target Audience	Fact‑checkers, journalists, researchers, data‑journalists
Core Feature	Multi‑model aggregation with search, structured rationale, and uncertainty scoring
Tech Stack	React (frontend), FastAPI (backend), OpenAI / Claude / Gemini APIs, Perplexity/Google Search API, PostgreSQL
Difficulty	Medium
Monetization	Revenue-ready: Subscription (tiered)

Notes

HN commenters repeatedly ask for “explanations before the answer” and “Abstain” options; this platform satisfies both.
Provides a reusable consensus API that can be queried from any downstream service, turning disagreement into actionable insight.

[FactPrompt Studio]

Summary

A collaborative prompt‑engineering library and UI for standardized fact‑checking rubrics, including definitions for “True / Mostly True / Misleading / False / Abstain”.
Core value: eliminates prompt ambiguity across studies and enables reproducible experiments.

Details

Key	Value
Target Audience	Researchers, LLM engineers, academic teams
Core Feature	Editable rubric templates, auto‑generated test‑case pipelines, version‑controlled prompt sharing
Tech Stack	Next.js (UI), Supabase (storage), OpenAPI spec for prompt execution, Docker for sandboxed model calls
Difficulty	Low
Monetization	Hobby

Notes- Discussions in HN about forced‑choice prompts and “no explanations” highlight the need for a shared rubric; this tool makes that easy.

Enables reproducible benchmarking without manual prompt rewriting, accelerating progress on factual accuracy studies.

[EvalGuard]

Summary

End‑to‑end benchmarking suite that evaluates frontier LLMs on factual claims, records model outputs, compares against human‑verified labels, and visualizes disagreement metrics (Krippendorff’s α, polar‑opposite rates).
Core value: supplies trustworthy, continuous accuracy monitoring for product teams.

Details

Key	Value
Target Audience	Product managers, QA engineers, AI safety teams
Core Feature	Batch claim ingestion, automated multi‑model evaluation, human‑label upload, statistical agreement dashboards
Tech Stack	Python (FastAPI), Celery workers, SQLite/Parquet storage, Plotly/Dash for dashboards, OpenAI/Anthropic/Gemini APIs
Difficulty	High
Monetization	Revenue-ready: Usage‑based pricing (per 1k claims)

Notes

HN threads stress the need for “inter‑model variance” and “human baseline” data; EvalGuard delivers both in a single pipeline.
Can be embedded in CI/CD pipelines, turning factual accuracy checks into a regular quality gate for AI services.

[UncertaintyAPI]

Summary

Managed API that wraps frontier LLMs with optional search, forces a “thinking‑out‑loud” step, and returns a structured JSON with answer, confidence, and fallback “I don’t know” option.
Core value: gives developers a reliable, standardized way to query LLMs for factual questions without forcing false certainty.

Details

Key	Value
Target Audience	Software engineers, API integrators, app developers
Core Feature	Structured output (`{answer, confidence, rationale, source_links}`), automatic abstain when confidence < threshold
Tech Stack	FastAPI, Pydantic models, OpenAI / Claude / Gemini + Perplexity search adapters, Redis cache
Difficulty	Medium
Monetization	Revenue-ready: Tiered API calls (free tier 10k/month, paid beyond)

Notes

Users in HN lament the lack of “I don’t know” and forced‑choice answers; this API directly addresses that need.
Enables downstream apps (e.g., FAQ bots, research assistants) to surface uncertainty, improving trust and reducing misinformation.

[CollabFact Marketplace]

Summary- A marketplace where domain experts, fact‑checking organizations, and AI providers co‑verify user‑submitted claims, delivering joint verdicts with transparent contributor signatures.

Core value: merges crowd wisdom with AI speed, producing defensible fact‑checks that users can cite.

Details| Key | Value |

|-----|-------| | Target Audience | Fact‑checking NGOs, academic publishers, media outlets | | Core Feature | Crowd‑sourced expert review, AI pre‑screening, consensus voting, verifiable audit trail | | Tech Stack | Node.js (backend), GraphQL, React UI, PostgreSQL, OpenAPI for AI connectors, SSI identity for reviewer attestation | | Difficulty | High | | Monetization | Revenue-ready: Transaction fee per verified claim |

Notes

HN participants often ask “how many humans would disagree?” indicating demand for human‑centric verification alongside AI results.
Offers a transparent, auditable workflow that satisfies calls for “human‑labelled expected response” and mitigates the perception of AI‑only bias.

Disagreement among frontier LLMs on real-world fact-checks

🚀 Project Ideas

[ConsensusCheck]

Summary

Details

Notes

[FactPrompt Studio]

Summary

Details

Notes- Discussions in HN about forced‑choice prompts and “no explanations” highlight the need for a shared rubric; this tool makes that easy.

[EvalGuard]

Summary

Details

Notes

[UncertaintyAPI]

Summary

Details

Notes

[CollabFact Marketplace]

Summary- A marketplace where domain experts, fact‑checking organizations, and AI providers co‑verify user‑submitted claims, delivering joint verdicts with transparent contributor signatures.

Details| Key | Value |

Notes

Read Later