SWE-bench Verified no longer measures frontier coding capabilities

📝 Discussion Summary (Click to expand)

1. Localization/ Translation UX Friction

"I don't understand these websites which force translation to my native language... where is the button for disabling it?" – w4yai > "‘codage de pointe’ sounds so weird and cringe in French." – w4yai

2. Benchmark Contamination & Moving Goalposts

"This feels very much like 'we are now moving the goal posts'." – 1a527dd5
"We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests." – embedding‑shape

3. Benchmark Saturation & Goodhart’s Law

"Benchmarks essentially aren't, for practical concerns anyways." – embedding‑shape
"Is this saying a quarter of the questions and answers were wrong, this whole time?!" – vintagedave*

4. Push for Private / Custom Evaluations

"Spend a hour or an afternoon creating your own eval harness with problems or workloads from your private repos or personal projects." – dannyw
"I made Zork bench… it’s deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t." – mnky9800n

🚀 Project Ideas

LinguaSwitch BrowserExtension

Summary

Users frequently encounter automatic translations they don’t want and lack a clear way to disable them.
Existing solutions require manual toggles per site; a universal one‑click toggle would be far more convenient.

Details

Key	Value
Target Audience	Web browsers users, multilingual readers, developers who need original language view
Core Feature	One‑click toggle to disable automatic translation and force the page’s original language; optional per‑site whitelisting
Tech Stack	Chrome/Firefox extension (Manifest V3), JavaScript, Localized UI, optional cloud sync for preferences
Difficulty	Low
Monetization	Revenue-ready: Freemium (basic free, premium adds per‑site whitelisting & custom hotkeys)

Notes

Directly addresses w4yai’s complaint about “no button for disabling” translation and the desire for a native override.
Simple UI aligns with HN users’ frustration and would be a quick win for browser vendors.

OpenBenchmark Hub

Summary

A community‑driven repository of private, auditable coding challenge suites that are updated regularly to avoid contamination.
Enables contributors to submit tasks, vote on quality, and automatically run them against any LLM. ### Details | Key | Value | |-----|-------| | Target Audience | LLM developers, researchers, hobbyists seeking trustworthy evaluation | | Core Feature | Host, version, and run community‑submitted benchmark tasks; CI detects data leakage; multi‑suite scoring | | Tech Stack | Django backend, Docker containers for task execution, GitHub API integration, Markdown task descriptions | | Difficulty | Medium | | Monetization | Revenue-ready: Subscription for teams (private repo access + priority support) + sponsorship tiers |

Notes

Provides the “manually override” style solution HN participants called for, by letting users choose or create their own benchmarks instead of relying on saturated public ones. - Encourages collaborative standards and reduces the “goalpost moving” critique.

AuditLens Benchmark Auditing SaaS

Summary

Transparent audit tool that scans public benchmarks (e.g., SWE‑bench) for flawed test cases, contamination, and underspecified tasks.
Generates human‑readable reports showing confidence intervals and uncertainty. ### Details | Key | Value | |-----|-------| | Target Audience | Model providers, investors, regulators, journalists evaluating AI capabilities | | Core Feature | Upload benchmark dataset; run statistical checks; flag ambiguous or leaking tests; output audit report | | Tech Stack | Python (pandas, NumPy), Streamlit UI, PostgreSQL backend | | Difficulty | High | | Monetization | Revenue-ready: SaaS tiered pricing per audit volume and custom reporting |

Notes

Tackles the “without bringing proof” criticism by offering independently verified analysis of benchmark validity.
Meets HN demand for more rigorous scrutiny of benchmark claims and contamination evidence.

ReBench Private Coding Challenges

Summary

Generates deterministic, out‑of‑distribution coding challenges directly from a developer’s own repositories.
Includes robustness tests such as injecting 200k distracting tokens to evaluate context handling.

Details

Key	Value
Target Audience	Engineering teams, AI product managers, LLM API providers
Core Feature	Scans codebase, extracts representative issues, builds automated test harnesses with verification, runs them on any model
Tech Stack	Node.js CLI, Python verifier scripts, GitHub Actions for CI, Docker sandbox, SQLite metadata store
Difficulty	Medium
Monetization	Revenue-ready: Pay‑per‑run API + enterprise subscription for custom harness generation

Notes

Aligns with calls for private, non‑contaminated benchmarks and the need to measure real‑world productivity beyond public suites.
Offers a practical workflow that mitigates the “over‑fitting to benchmarks” problem highlighted in the discussion.

SWE-bench Verified no longer measures frontier coding capabilities

🚀 Project Ideas

LinguaSwitch BrowserExtension

Summary

Details

Notes

OpenBenchmark Hub

Summary

Notes

AuditLens Benchmark Auditing SaaS

Summary

Notes

ReBench Private Coding Challenges

Summary

Details

Notes

Read Later