Are LLM merge rates not getting better?

📝 Discussion Summary (Click to expand)

1. “Did the models actually get better?”
Users are split between anecdotal evidence of a jump (especially around Opus 4.5/4.6) and the METR data that shows a flat or even declining trend when all models are lumped together.

“I’m really seeing it as well.” – postflopclarity
“The study shows a flat line… if you exclude GPT‑5 it matches a logistic curve.” – wongarsu

2. The harness is the real game‑changer
Many respondents argue that the leap in productivity comes from better tooling, agentic loops, and context‑management, not from the core model’s reasoning.

“We’ve gotten better in harnessing not the models’ actual reasoning.” – jwpapi
“The practical capability jump is huge when you combine models with tool use, planning loops, and persistent context.” – idorozin

3. Trust, reliability, and accountability remain hard problems
Even with newer models, users still need to review, fix, and sometimes blame the output. The lack of a clear owner for errors fuels skepticism.

“The issue with LLM’s is trust… who is accountable?” – jygg4
“We need to convince customers that we have the right technology… accountability is not easy.” – marcuschong

These three themes—debate over true model progress, the primacy of tooling, and ongoing trust/accountability concerns—dominate the discussion.

🚀 Project Ideas

[Deterministic Code Review Bot]

Summary

A service that ingests a repository, runs reproducible LLM reviews with traceable rationales, and outputs deterministic feedback to eliminate random hallucinations.
Guarantees consistent, actionable comments for every pull request.

Details

Key	Value
Target Audience	Engineering teams & codebase maintainers
Core Feature	Deterministic review generation with audit trail
Tech Stack	Backend: Python, FastAPI; LLM hosting via Anthropic Claude; Persistent DB (Postgres); Front‑end: React
Difficulty	Medium
Monetization	Revenue-ready: $20/mo per user

Notes

HN users repeatedly cite “random failures” and “need for reliable review” as pain points – this directly addresses that.
Offers practical utility by reducing manual review load and enabling automated CI checks.

[Persistent Multi‑File Agent Orchestrator]

Summary

An orchestration platform that maintains state across many files, automatically resolves merge conflicts, and tracks iterative changes without losing context.
Eliminates the “multi‑file editing nightmare” reported by many developers.

Details

Key	Value
Target Audience	Full‑stack developers & DevOps teams
Core Feature	Persistent repo‑wide context and conflict‑aware editing
Tech Stack	Backend: Go + gRPC; Persistence: SQLite with versioning; Front‑end: VS Code extension; Cloud: AWS Lambda
Difficulty	High
Monetization	Revenue-ready: $30/mo per team

Notes

Commenters like “mountainriver” and “davecoffin” stress the need for seamless multi‑file changes – this solves that.
Provides clear practical benefit by cutting down on back‑and‑forth edits and reducing production bugs.

[Benchmark‑Driven LLM Reliability Platform]

Summary

SaaS that runs standardized stress tests (e.g., multi‑file rename, edge‑case logic bugs) on any LLM and returns pass/fail scores with detailed failure diagnostics.
Gives users objective evidence of model improvement beyond anecdotal claims.

Details| Key | Value |

|-----|-------| | Target Audience | Engineering managers & quality assurance teams | | Core Feature | Automated regression benchmark suite with actionable failure reports | | Tech Stack | Backend: Node.js + Docker; Test harness: custom Python scripts; UI: Vue.js; Analytics: ElasticSearch | | Difficulty | Medium | | Monetization | Revenue-ready: $15/mo per seat |

Notes

Directly responds to “hrmtst93837” and “wongarsu” concerns about lacking rigorous evals – this provides the missing data.
Enables discussion by offering quantifiable metrics that HN users can cite.

[Constrained Refactoring Assistant]

Summary

An IDE plugin that guides LLMs through constrained refactorings, enforcing unit‑test coverage and static‑analysis rules before accepting changes.
Tackles code‑quality decay and “copy‑paste” anti‑patterns highlighted by several commenters.

Details

Key	Value
Target Audience	Individual developers & small teams using VS Code or JetBrains
Core Feature	Guided refactoring with test‑validation loop
Tech Stack	Extension: TypeScript; Backend checker: ESLint + PyTest; Integration: VS Code API
Difficulty	Low
Monetization	Hobby

Notes

Appeals to “jygg4” and “sho_hn” who emphasize the need for reliable refactoring and code‑quality checks.
Offers immediate utility by helping users avoid tech‑debt without requiring deep LLM expertise.

Are LLM merge rates not getting better?

🚀 Project Ideas

[Deterministic Code Review Bot]

Summary

Details

Notes

[Persistent Multi‑File Agent Orchestrator]

Summary

Details

Notes

[Benchmark‑Driven LLM Reliability Platform]

Summary

Details| Key | Value |

Notes

[Constrained Refactoring Assistant]

Summary

Details

Notes

Read Later