Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

📝 Discussion Summary (Click to expand)

1. Requests for published results

"Are there any published results gathered using this?" – egeozcan

2. Documentation inaccuracies & outdated model references

"The example model in the documentation is 4o‑mini, you might want to update that to a more recent model." – ssgodderidge
"It’s an artifact of the documentation being AI generated, they usually pick gpt‑4 era models, without giving it further thought." – stingraycharles

3. Clarification on skill loading mechanics

"The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill." – block_dagger These three themes capture the main points of the discussion: curiosity about published outcomes, criticism of stale AI‑generated documentation, and clarification that skill inclusion is automatic rather than user‑selected.

🚀 Project Ideas

Published Results Tracker & Verifier

Summary

A dashboard that aggregates and validates published model performance results across papers, blogs, and open‑source repos.
Core value: eliminates misinformation by flagging outdated or inconsistent claims in real time.

Details

Key	Value
Target Audience	Researchers, engineers, and product teams that rely on up‑to‑date model benchmarks.
Core Feature	Automatic crawling of arXiv, Hugging Face, blog posts; cross‑checking reported metrics; generating a verifiable “truth badge” for each model.
Tech Stack	Python, Scrapy, FastAPI, PostgreSQL, Elasticsearch, Docker.
Difficulty	Medium
Monetization	Revenue-ready: subscription tier $19/mo per team

Notes

HN commenters often complain about stale docs (e.g., “4o‑mini” still used) – this tool would surface such gaps instantly.
Could spark discussion on transparency standards for model reporting.

Live Documentation Generator for Model Families

Summary- An auto‑updating doc site that pulls live model catalogs from AI labs and replaces placeholder models with the latest released versions.

Core value: users never see outdated model names like “4o‑mini” or “2.5” again.

Details

Key	Value
Target Audience	Documentation teams, open‑source project maintainers, AI educators.
Core Feature	Periodic API scraping of Gemini, Claude, GPT, etc.; generate markdown/API reference with current model identifiers and usage tips.
Tech Stack	Node.js, Puppeteer, GraphQL, MDX, CI/CD (GitHub Actions), Netlify.
Difficulty	Low
Monetization	Hobby

Notes

Directly addresses ss‑godderidge’s concern about “4o‑mini” lingering in docs.
Could become a community‑driven plugin for popular tutorials.

Prompt Judge Auto‑Rater & Optimizer

Summary

A SaaS that auto‑scores judge prompts, suggests improvements, and runs regression tests to keep evaluation criteria stable across model upgrades.
Core value: saves time on manual prompt tuning and reduces noisy rankings.

Details

Key	Value
Target Audience	ML engineers, AI benchmark designers, startup teams launching LLM‑based services.
Core Feature	Upload a judge prompt; service runs it against a suite of models, outputs consistency score, and offers automated rewrite suggestions.
Tech Stack	Rust (for fast inference), Docker, PostgreSQL, FastAPI, React frontend.
Difficulty	High
Monetization	Revenue-ready: usage‑based pricing $0.01 per evaluation.

Notes

Directly answers ianhxu’s question about “How do you iterate on the judge prompt?” – provides a closed loop.
Likely to generate buzz on HN as a productivity booster for LLM evaluation pipelines.

Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

🚀 Project Ideas

Published Results Tracker & Verifier

Summary

Details

Notes

Live Documentation Generator for Model Families

Summary- An auto‑updating doc site that pulls live model catalogs from AI labs and replaces placeholder models with the latest released versions.

Details

Notes

Prompt Judge Auto‑Rater & Optimizer

Summary

Details

Notes

Read Later