Project ideas from Hacker News discussions.

Senior SWE-Bench: open-source benchmark that assesses agents as senior engineers

📝 Discussion Summary (Click to expand)

Three dominant themes from the discussion

  1. Questioning the validity & fairness of “senior‑level” benchmarks
    “Why didn't they just make it ‘Staff SWE‑Bench’, would be much better smh.”danpalmer
    The community argues that current harnesses are too ad‑hoc and don’t capture the breadth of senior work (planning, context, stakeholder input). Without clearer objectives, comparisons feel “subjective” and don’t reflect real engineering seniority.

  2. Human‑in‑the‑loop & asking for clarification are essential for higher performance
    “I wonder if a model could score higher if it had a human at its disposal?”lacunary
    “A model that can ask questions or ask for help when in doubt is indeed a major feat.”sinuhe69
    Many users note that allowing an LLM to request or accept human‑provided input (queries, context, feedback) can dramatically improve outcomes, yet current benchmarks rarely enable this.

  3. The “taste” of code is treated as a vague, subjective metric rather than an objective quality indicator
    “Taste is just quality by instinct.”FeepingCreature
    Opinions about “tasteful” or “elegant” code often boil down to gut feelings, leading to cargo‑culting and endless bikeshedding. Critics stress that real benchmarking should focus on measurable results (functionality, maintainability, bug count) instead of aesthetic judgments.


🚀 Project Ideas

ClarifyAI

Summary

  • Provides an interactive layer that automatically detects ambiguity in LLM prompts and generates targeted clarification questions for the user.
  • Solves the recurring frustration of LLMs guessing incorrectly when faced with underspecified requirements.

Details

Key Value
Target Audience AI developers, research labs, LLM product teams
Core Feature Real‑time clarification engine that suggests “Ask me about X?” prompts and lets users answer inline
Tech Stack Python (FastAPI), React, PostgreSQL, OpenAI‑compatible API
Difficulty Medium
Monetization Revenue-ready: Subscription $12/mo per user

Notes

  • HN commenters said “a model that can ask questions or ask for help when in doubt is indeed a major feat” and “models should ask for human‑in‑the‑loop input”.
  • Potential for integration into existing pipelines to reduce re‑work and improve benchmark reliability.

ISOScore

Summary

  • An automated static‑analysis service that maps codebases to ISO 5055 quality criteria and outputs an objective maintainability score.
  • Provides a standardized, reproducible benchmark that replaces subjective “taste” discussions with measurable metrics.

Details

Key Value
Target Audience Engineering managers, open‑source maintainers, code‑review bots
Core Feature Full repo scan → ISO 5055 compliance report with defect density, cyclomatic complexity, and readability ratings
Tech Stack Rust backend, PostgreSQL, Docker, React dashboard
Difficulty High
Monetization Revenue-ready: Tiered SaaS $29/mo (basic) / $99/mo (enterprise)

Notes

  • Referenced by HN users seeking “an objective benchmark” and citing ISO 5055 as a concrete standard.
  • Offers immediate utility for CI pipelines and quality gates, sparking discussion on replacing vague “taste” assessments.

TasteMeter

Summary

  • AI‑driven code‑review API that evaluates “taste” and elegance using embedding similarity, design‑pattern detection, and heuristic rules.
  • Turns subjective design judgments into quantifiable feedback that helps teams refactor toward better maintainability.

Details

Key Value
Target Audience Open‑source contributors, freelance developers, dev‑tool platforms
Core Feature Upload snippet → receive a “taste score”, actionable refactor suggestions, and risk warnings
Tech Stack Go microservice, TensorFlow embeddings, ElasticSearch, GraphQL gateway
Difficulty Medium
Monetization Revenue-ready: Pay‑per‑review $0.05/query or $15/mo for unlimited access

Notes

  • Echoes HN sentiments like “taste is just quality by instinct” and “subjective but useful feedback”.
  • Creates a discussion platform around measurable code aesthetics while delivering practical improvement suggestions.

Read Later