Project ideas from Hacker News discussions.

Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

📝 Discussion Summary (Click to expand)

1. Requests for published results

"Are there any published results gathered using this?" – egeozcan

2. Documentation inaccuracies & outdated model references

"The example model in the documentation is 4o‑mini, you might want to update that to a more recent model." – ssgodderidge
"It’s an artifact of the documentation being AI generated, they usually pick gpt‑4 era models, without giving it further thought." – stingraycharles

3. Clarification on skill loading mechanics

"The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill." – block_dagger These three themes capture the main points of the discussion: curiosity about published outcomes, criticism of stale AI‑generated documentation, and clarification that skill inclusion is automatic rather than user‑selected.


🚀 Project Ideas

Published Results Tracker & Verifier

Summary

  • A dashboard that aggregates and validates published model performance results across papers, blogs, and open‑source repos.
  • Core value: eliminates misinformation by flagging outdated or inconsistent claims in real time.

Details

Key Value
Target Audience Researchers, engineers, and product teams that rely on up‑to‑date model benchmarks.
Core Feature Automatic crawling of arXiv, Hugging Face, blog posts; cross‑checking reported metrics; generating a verifiable “truth badge” for each model.
Tech Stack Python, Scrapy, FastAPI, PostgreSQL, Elasticsearch, Docker.
Difficulty Medium
Monetization Revenue-ready: subscription tier $19/mo per team

Notes

  • HN commenters often complain about stale docs (e.g., “4o‑mini” still used) – this tool would surface such gaps instantly.
  • Could spark discussion on transparency standards for model reporting.

Live Documentation Generator for Model Families

Summary- An auto‑updating doc site that pulls live model catalogs from AI labs and replaces placeholder models with the latest released versions.

  • Core value: users never see outdated model names like “4o‑mini” or “2.5” again.

Details

Key Value
Target Audience Documentation teams, open‑source project maintainers, AI educators.
Core Feature Periodic API scraping of Gemini, Claude, GPT, etc.; generate markdown/API reference with current model identifiers and usage tips.
Tech Stack Node.js, Puppeteer, GraphQL, MDX, CI/CD (GitHub Actions), Netlify.
Difficulty Low
Monetization Hobby

Notes

  • Directly addresses ss‑godderidge’s concern about “4o‑mini” lingering in docs.
  • Could become a community‑driven plugin for popular tutorials.

Prompt Judge Auto‑Rater & Optimizer

Summary

  • A SaaS that auto‑scores judge prompts, suggests improvements, and runs regression tests to keep evaluation criteria stable across model upgrades.
  • Core value: saves time on manual prompt tuning and reduces noisy rankings.

Details

Key Value
Target Audience ML engineers, AI benchmark designers, startup teams launching LLM‑based services.
Core Feature Upload a judge prompt; service runs it against a suite of models, outputs consistency score, and offers automated rewrite suggestions.
Tech Stack Rust (for fast inference), Docker, PostgreSQL, FastAPI, React frontend.
Difficulty High
Monetization Revenue-ready: usage‑based pricing $0.01 per evaluation.

Notes

  • Directly answers ianhxu’s question about “How do you iterate on the judge prompt?” – provides a closed loop.
  • Likely to generate buzz on HN as a productivity booster for LLM evaluation pipelines.

Read Later