5prevailing themes in the HN discussion
-
Forced‑choice rubric inflates disagreement – the 4‑bucket (True / Mostly True / Misleading / False) prompt makes any label mismatch count as a “disagreement,” so the reported 67 % figure is really a floor on rubric inconsistency.
“the 67% figure is a floor on rubric inconsistency (at least one model is label‑inconsistent under the 4‑bucket rubric on 67% of claims), not ‘model X is factually wrong on claim Y.’” — kostaj
-
No “I don’t know” / abstain option forces guessing – models are compelled to output a label even when they lack confidence, which skews the disagreement numbers.
“Very good point to measure the inter‑model disagreement! Will add in the next version.” — kostaj (referring to plans to reintroduce an abstain bucket)
-
Disagreement ≠ correctness without a ground‑truth baseline – the study measures only inter‑model variance, not which model (if any) is right; without a vetted human reference the 67 % metric tells us little.
“The title mention ‘fact‑checks’, but ‘fact checking’ is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game.” — throw310822
-
Even retrieval‑augmented models disagree; knowledge cut‑offs limit answers – adding search doesn’t eliminate conflict, and many claims are unanswerable by parametric models alone.
“Without a search tool the only correct answer to ‘On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia’ is ‘this claim is impossible for me to verify’. And that wasn’t an option.” — simonw
“Two of the models used have retrieval capabilities and have access to newer information through search. The disagreement between them is still quite significant – 42%.” — kostaj -
Skepticism toward headline results and call for better methodology – many argue the headline “67 % disagreement” is sensationalized; a more meaningful figure is the 34 % of claims with substantive (polar‑opposite) verdicts.
“In section 2, 34% of cases are found to have ‘substantive’ disagreements differing by 2 or more buckets – True + Misleading, Mostly True + False, or True + False.” — pjdesno
These themes capture the core critiques, methodological concerns, and suggestions for improvement that dominate the Hacker News commentary.