Project ideas from Hacker News discussions.

Are LLM merge rates not getting better?

📝 Discussion Summary (Click to expand)

1. “Did the models actually get better?”
Users are split between anecdotal evidence of a jump (especially around Opus 4.5/4.6) and the METR data that shows a flat or even declining trend when all models are lumped together.

“I’m really seeing it as well.” – postflopclarity
“The study shows a flat line… if you exclude GPT‑5 it matches a logistic curve.” – wongarsu

2. The harness is the real game‑changer
Many respondents argue that the leap in productivity comes from better tooling, agentic loops, and context‑management, not from the core model’s reasoning.

“We’ve gotten better in harnessing not the models’ actual reasoning.” – jwpapi
“The practical capability jump is huge when you combine models with tool use, planning loops, and persistent context.” – idorozin

3. Trust, reliability, and accountability remain hard problems
Even with newer models, users still need to review, fix, and sometimes blame the output. The lack of a clear owner for errors fuels skepticism.

“The issue with LLM’s is trust… who is accountable?” – jygg4
“We need to convince customers that we have the right technology… accountability is not easy.” – marcuschong

These three themes—debate over true model progress, the primacy of tooling, and ongoing trust/accountability concerns—dominate the discussion.


🚀 Project Ideas

[Deterministic Code Review Bot]

Summary

  • A service that ingests a repository, runs reproducible LLM reviews with traceable rationales, and outputs deterministic feedback to eliminate random hallucinations.
  • Guarantees consistent, actionable comments for every pull request.

Details

Key Value
Target Audience Engineering teams & codebase maintainers
Core Feature Deterministic review generation with audit trail
Tech Stack Backend: Python, FastAPI; LLM hosting via Anthropic Claude; Persistent DB (Postgres); Front‑end: React
Difficulty Medium
Monetization Revenue-ready: $20/mo per user

Notes

  • HN users repeatedly cite “random failures” and “need for reliable review” as pain points – this directly addresses that.
  • Offers practical utility by reducing manual review load and enabling automated CI checks.

[Persistent Multi‑File Agent Orchestrator]

Summary

  • An orchestration platform that maintains state across many files, automatically resolves merge conflicts, and tracks iterative changes without losing context.
  • Eliminates the “multi‑file editing nightmare” reported by many developers.

Details

Key Value
Target Audience Full‑stack developers & DevOps teams
Core Feature Persistent repo‑wide context and conflict‑aware editing
Tech Stack Backend: Go + gRPC; Persistence: SQLite with versioning; Front‑end: VS Code extension; Cloud: AWS Lambda
Difficulty High
Monetization Revenue-ready: $30/mo per team

Notes

  • Commenters like “mountainriver” and “davecoffin” stress the need for seamless multi‑file changes – this solves that.
  • Provides clear practical benefit by cutting down on back‑and‑forth edits and reducing production bugs.

[Benchmark‑Driven LLM Reliability Platform]

Summary

  • SaaS that runs standardized stress tests (e.g., multi‑file rename, edge‑case logic bugs) on any LLM and returns pass/fail scores with detailed failure diagnostics.
  • Gives users objective evidence of model improvement beyond anecdotal claims.

Details| Key | Value |

|-----|-------| | Target Audience | Engineering managers & quality assurance teams | | Core Feature | Automated regression benchmark suite with actionable failure reports | | Tech Stack | Backend: Node.js + Docker; Test harness: custom Python scripts; UI: Vue.js; Analytics: ElasticSearch | | Difficulty | Medium | | Monetization | Revenue-ready: $15/mo per seat |

Notes

  • Directly responds to “hrmtst93837” and “wongarsu” concerns about lacking rigorous evals – this provides the missing data.
  • Enables discussion by offering quantifiable metrics that HN users can cite.

[Constrained Refactoring Assistant]

Summary

  • An IDE plugin that guides LLMs through constrained refactorings, enforcing unit‑test coverage and static‑analysis rules before accepting changes.
  • Tackles code‑quality decay and “copy‑paste” anti‑patterns highlighted by several commenters.

Details

Key Value
Target Audience Individual developers & small teams using VS Code or JetBrains
Core Feature Guided refactoring with test‑validation loop
Tech Stack Extension: TypeScript; Backend checker: ESLint + PyTest; Integration: VS Code API
Difficulty Low
Monetization Hobby

Notes

  • Appeals to “jygg4” and “sho_hn” who emphasize the need for reliable refactoring and code‑quality checks.
  • Offers immediate utility by helping users avoid tech‑debt without requiring deep LLM expertise.

Read Later