Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

📝 Discussion Summary (Click to expand)

Three dominant themes from the discussion

Questioning the validity & fairness of “senior‑level” benchmarks
“Why didn't they just make it ‘Staff SWE‑Bench’, would be much better smh.” – danpalmer
The community argues that current harnesses are too ad‑hoc and don’t capture the breadth of senior work (planning, context, stakeholder input). Without clearer objectives, comparisons feel “subjective” and don’t reflect real engineering seniority.
Human‑in‑the‑loop & asking for clarification are essential for higher performance
“I wonder if a model could score higher if it had a human at its disposal?” – lacunary
“A model that can ask questions or ask for help when in doubt is indeed a major feat.” – sinuhe69
Many users note that allowing an LLM to request or accept human‑provided input (queries, context, feedback) can dramatically improve outcomes, yet current benchmarks rarely enable this.
The “taste” of code is treated as a vague, subjective metric rather than an objective quality indicator
“Taste is just quality by instinct.” – FeepingCreature
Opinions about “tasteful” or “elegant” code often boil down to gut feelings, leading to cargo‑culting and endless bikeshedding. Critics stress that real benchmarking should focus on measurable results (functionality, maintainability, bug count) instead of aesthetic judgments.

🚀 Project Ideas

Provides an interactive layer that automatically detects ambiguity in LLM prompts and generates targeted clarification questions for the user.
Solves the recurring frustration of LLMs guessing incorrectly when faced with underspecified requirements.

Key	Value
Target Audience	AI developers, research labs, LLM product teams
Core Feature	Real‑time clarification engine that suggests “Ask me about X?” prompts and lets users answer inline
Tech Stack	Python (FastAPI), React, PostgreSQL, OpenAI‑compatible API
Difficulty	Medium
Monetization	Revenue-ready: Subscription $12/mo per user

HN commenters said “a model that can ask questions or ask for help when in doubt is indeed a major feat” and “models should ask for human‑in‑the‑loop input”.
Potential for integration into existing pipelines to reduce re‑work and improve benchmark reliability.

An automated static‑analysis service that maps codebases to ISO 5055 quality criteria and outputs an objective maintainability score.
Provides a standardized, reproducible benchmark that replaces subjective “taste” discussions with measurable metrics.

Key	Value
Target Audience	Engineering managers, open‑source maintainers, code‑review bots
Core Feature	Full repo scan → ISO 5055 compliance report with defect density, cyclomatic complexity, and readability ratings
Tech Stack	Rust backend, PostgreSQL, Docker, React dashboard
Difficulty	High
Monetization	Revenue-ready: Tiered SaaS $29/mo (basic) / $99/mo (enterprise)

Referenced by HN users seeking “an objective benchmark” and citing ISO 5055 as a concrete standard.
Offers immediate utility for CI pipelines and quality gates, sparking discussion on replacing vague “taste” assessments.

AI‑driven code‑review API that evaluates “taste” and elegance using embedding similarity, design‑pattern detection, and heuristic rules.
Turns subjective design judgments into quantifiable feedback that helps teams refactor toward better maintainability.

Key	Value
Target Audience	Open‑source contributors, freelance developers, dev‑tool platforms
Core Feature	Upload snippet → receive a “taste score”, actionable refactor suggestions, and risk warnings
Tech Stack	Go microservice, TensorFlow embeddings, ElasticSearch, GraphQL gateway
Difficulty	Medium
Monetization	Revenue-ready: Pay‑per‑review $0.05/query or $15/mo for unlimited access

Echoes HN sentiments like “taste is just quality by instinct” and “subjective but useful feedback”.
Creates a discussion platform around measurable code aesthetics while delivering practical improvement suggestions.