Project ideas from Hacker News discussions.

Disagreement among frontier LLMs on real-world fact-checks

📝 Discussion Summary (Click to expand)

5prevailing themes in the HN discussion

  1. Forced‑choice rubric inflates disagreement – the 4‑bucket (True / Mostly True / Misleading / False) prompt makes any label mismatch count as a “disagreement,” so the reported 67 % figure is really a floor on rubric inconsistency.

    “the 67% figure is a floor on rubric inconsistency (at least one model is label‑inconsistent under the 4‑bucket rubric on 67% of claims), not ‘model X is factually wrong on claim Y.’” — kostaj

  2. No “I don’t know” / abstain option forces guessing – models are compelled to output a label even when they lack confidence, which skews the disagreement numbers.

    “Very good point to measure the inter‑model disagreement! Will add in the next version.” — kostaj (referring to plans to reintroduce an abstain bucket)

  3. Disagreement ≠ correctness without a ground‑truth baseline – the study measures only inter‑model variance, not which model (if any) is right; without a vetted human reference the 67 % metric tells us little.

    “The title mention ‘fact‑checks’, but ‘fact checking’ is a process in which facts are checked against sources, not one where you are given a random fact and have to tell if it's true or false from your own memory. That's what is normally called a quiz game.” — throw310822

  4. Even retrieval‑augmented models disagree; knowledge cut‑offs limit answers – adding search doesn’t eliminate conflict, and many claims are unanswerable by parametric models alone.

    “Without a search tool the only correct answer to ‘On May 18, 2026, Ukraine carried out a drone attack on Moscow, Russia’ is ‘this claim is impossible for me to verify’. And that wasn’t an option.” — simonw
    “Two of the models used have retrieval capabilities and have access to newer information through search. The disagreement between them is still quite significant – 42%.” — kostaj

  5. Skepticism toward headline results and call for better methodology – many argue the headline “67 % disagreement” is sensationalized; a more meaningful figure is the 34 % of claims with substantive (polar‑opposite) verdicts.

    “In section 2, 34% of cases are found to have ‘substantive’ disagreements differing by 2 or more buckets – True + Misleading, Mostly True + False, or True + False.” — pjdesno

These themes capture the core critiques, methodological concerns, and suggestions for improvement that dominate the Hacker News commentary.


🚀 Project Ideas

[ConsensusCheck]

Summary

  • Aggregates multiple frontier LLM verdicts on user‑submitted claims, adds web‑search, rationale, and an optional “Abstain” bucket.
  • Reduces arbitrary disagreement and surfaces confidence scores, giving fact‑checkers a reliable decision layer.

Details

Key Value
Target Audience Fact‑checkers, journalists, researchers, data‑journalists
Core Feature Multi‑model aggregation with search, structured rationale, and uncertainty scoring
Tech Stack React (frontend), FastAPI (backend), OpenAI / Claude / Gemini APIs, Perplexity/Google Search API, PostgreSQL
Difficulty Medium
Monetization Revenue-ready: Subscription (tiered)

Notes

  • HN commenters repeatedly ask for “explanations before the answer” and “Abstain” options; this platform satisfies both.
  • Provides a reusable consensus API that can be queried from any downstream service, turning disagreement into actionable insight.

[FactPrompt Studio]

Summary

  • A collaborative prompt‑engineering library and UI for standardized fact‑checking rubrics, including definitions for “True / Mostly True / Misleading / False / Abstain”.
  • Core value: eliminates prompt ambiguity across studies and enables reproducible experiments.

Details

Key Value
Target Audience Researchers, LLM engineers, academic teams
Core Feature Editable rubric templates, auto‑generated test‑case pipelines, version‑controlled prompt sharing
Tech Stack Next.js (UI), Supabase (storage), OpenAPI spec for prompt execution, Docker for sandboxed model calls
Difficulty Low
Monetization Hobby

Notes- Discussions in HN about forced‑choice prompts and “no explanations” highlight the need for a shared rubric; this tool makes that easy.

  • Enables reproducible benchmarking without manual prompt rewriting, accelerating progress on factual accuracy studies.

[EvalGuard]

Summary

  • End‑to‑end benchmarking suite that evaluates frontier LLMs on factual claims, records model outputs, compares against human‑verified labels, and visualizes disagreement metrics (Krippendorff’s α, polar‑opposite rates).
  • Core value: supplies trustworthy, continuous accuracy monitoring for product teams.

Details

Key Value
Target Audience Product managers, QA engineers, AI safety teams
Core Feature Batch claim ingestion, automated multi‑model evaluation, human‑label upload, statistical agreement dashboards
Tech Stack Python (FastAPI), Celery workers, SQLite/Parquet storage, Plotly/Dash for dashboards, OpenAI/Anthropic/Gemini APIs
Difficulty High
Monetization Revenue-ready: Usage‑based pricing (per 1k claims)

Notes

  • HN threads stress the need for “inter‑model variance” and “human baseline” data; EvalGuard delivers both in a single pipeline.
  • Can be embedded in CI/CD pipelines, turning factual accuracy checks into a regular quality gate for AI services.

[UncertaintyAPI]

Summary

  • Managed API that wraps frontier LLMs with optional search, forces a “thinking‑out‑loud” step, and returns a structured JSON with answer, confidence, and fallback “I don’t know” option.
  • Core value: gives developers a reliable, standardized way to query LLMs for factual questions without forcing false certainty.

Details

Key Value
Target Audience Software engineers, API integrators, app developers
Core Feature Structured output ({answer, confidence, rationale, source_links}), automatic abstain when confidence < threshold
Tech Stack FastAPI, Pydantic models, OpenAI / Claude / Gemini + Perplexity search adapters, Redis cache
Difficulty Medium
Monetization Revenue-ready: Tiered API calls (free tier 10k/month, paid beyond)

Notes

  • Users in HN lament the lack of “I don’t know” and forced‑choice answers; this API directly addresses that need.
  • Enables downstream apps (e.g., FAQ bots, research assistants) to surface uncertainty, improving trust and reducing misinformation.

[CollabFact Marketplace]

Summary- A marketplace where domain experts, fact‑checking organizations, and AI providers co‑verify user‑submitted claims, delivering joint verdicts with transparent contributor signatures.

  • Core value: merges crowd wisdom with AI speed, producing defensible fact‑checks that users can cite.

Details| Key | Value |

|-----|-------| | Target Audience | Fact‑checking NGOs, academic publishers, media outlets | | Core Feature | Crowd‑sourced expert review, AI pre‑screening, consensus voting, verifiable audit trail | | Tech Stack | Node.js (backend), GraphQL, React UI, PostgreSQL, OpenAPI for AI connectors, SSI identity for reviewer attestation | | Difficulty | High | | Monetization | Revenue-ready: Transaction fee per verified claim |

Notes

  • HN participants often ask “how many humans would disagree?” indicating demand for human‑centric verification alongside AI results.
  • Offers a transparent, auditable workflow that satisfies calls for “human‑labelled expected response” and mitigates the perception of AI‑only bias.

Read Later