Project ideas from Hacker News discussions.

SWE-bench Verified no longer measures frontier coding capabilities

📝 Discussion Summary (Click to expand)

1. Localization/ Translation UX Friction

"I don't understand these websites which force translation to my native language... where is the button for disabling it?" – w4yai > "‘codage de pointe’ sounds so weird and cringe in French." – w4yai

2. Benchmark Contamination & Moving Goalposts

"This feels very much like 'we are now moving the goal posts'." – 1a527dd5
"We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests." – embedding‑shape

3. Benchmark Saturation & Goodhart’s Law

"Benchmarks essentially aren't, for practical concerns anyways." – embedding‑shape
"Is this saying a quarter of the questions and answers were wrong, this whole time?!" – vintagedave*

4. Push for Private / Custom Evaluations

"Spend a hour or an afternoon creating your own eval harness with problems or workloads from your private repos or personal projects." – dannyw
"I made Zork bench… it’s deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t." – mnky9800n


🚀 Project Ideas

LinguaSwitch BrowserExtension

Summary

  • Users frequently encounter automatic translations they don’t want and lack a clear way to disable them.
  • Existing solutions require manual toggles per site; a universal one‑click toggle would be far more convenient.

Details

Key Value
Target Audience Web browsers users, multilingual readers, developers who need original language view
Core Feature One‑click toggle to disable automatic translation and force the page’s original language; optional per‑site whitelisting
Tech Stack Chrome/Firefox extension (Manifest V3), JavaScript, Localized UI, optional cloud sync for preferences
Difficulty Low
Monetization Revenue-ready: Freemium (basic free, premium adds per‑site whitelisting & custom hotkeys)

Notes

  • Directly addresses w4yai’s complaint about “no button for disabling” translation and the desire for a native override.
  • Simple UI aligns with HN users’ frustration and would be a quick win for browser vendors.

OpenBenchmark Hub

Summary

  • A community‑driven repository of private, auditable coding challenge suites that are updated regularly to avoid contamination.
  • Enables contributors to submit tasks, vote on quality, and automatically run them against any LLM. ### Details | Key | Value | |-----|-------| | Target Audience | LLM developers, researchers, hobbyists seeking trustworthy evaluation | | Core Feature | Host, version, and run community‑submitted benchmark tasks; CI detects data leakage; multi‑suite scoring | | Tech Stack | Django backend, Docker containers for task execution, GitHub API integration, Markdown task descriptions | | Difficulty | Medium | | Monetization | Revenue-ready: Subscription for teams (private repo access + priority support) + sponsorship tiers |

Notes

  • Provides the “manually override” style solution HN participants called for, by letting users choose or create their own benchmarks instead of relying on saturated public ones. - Encourages collaborative standards and reduces the “goalpost moving” critique.

AuditLens Benchmark Auditing SaaS

Summary

  • Transparent audit tool that scans public benchmarks (e.g., SWE‑bench) for flawed test cases, contamination, and underspecified tasks.
  • Generates human‑readable reports showing confidence intervals and uncertainty. ### Details | Key | Value | |-----|-------| | Target Audience | Model providers, investors, regulators, journalists evaluating AI capabilities | | Core Feature | Upload benchmark dataset; run statistical checks; flag ambiguous or leaking tests; output audit report | | Tech Stack | Python (pandas, NumPy), Streamlit UI, PostgreSQL backend | | Difficulty | High | | Monetization | Revenue-ready: SaaS tiered pricing per audit volume and custom reporting |

Notes

  • Tackles the “without bringing proof” criticism by offering independently verified analysis of benchmark validity.
  • Meets HN demand for more rigorous scrutiny of benchmark claims and contamination evidence.

ReBench Private Coding Challenges

Summary

  • Generates deterministic, out‑of‑distribution coding challenges directly from a developer’s own repositories.
  • Includes robustness tests such as injecting 200k distracting tokens to evaluate context handling.

Details

Key Value
Target Audience Engineering teams, AI product managers, LLM API providers
Core Feature Scans codebase, extracts representative issues, builds automated test harnesses with verification, runs them on any model
Tech Stack Node.js CLI, Python verifier scripts, GitHub Actions for CI, Docker sandbox, SQLite metadata store
Difficulty Medium
Monetization Revenue-ready: Pay‑per‑run API + enterprise subscription for custom harness generation

Notes

  • Aligns with calls for private, non‑contaminated benchmarks and the need to measure real‑world productivity beyond public suites.
  • Offers a practical workflow that mitigates the “over‑fitting to benchmarks” problem highlighted in the discussion.

Read Later