Project ideas from Hacker News discussions.

Claude Code daily benchmarks for degradation tracking

📝 Discussion Summary (Click to expand)

1. Claude/Opus is seeming to get worse
Many users report a noticeable drop in accuracy or “dumb” responses, especially after a new release or during peak hours.

“I’ve noticed a degradation in Opus 4.5… it feels regressed by a generation.” – epolanski
“I’ve noticed a degradation… the model just gives up.” – jampa

2. The cause is unclear – load, bugs, or user skill?
Participants argue whether the drop is due to server load/quantization, software bugs, or simply users learning to prompt better.

“It could be a software bug affecting inference.” – gpm
“I think the degradation is because of subtle changes to CC prompts/tools.” – turnsout
“I’m learning more about what the model is and is not useful for, my subjective experience improves, not degrades.” – emp17344

3. Benchmark design and statistical rigor are hotly debated
The community questions the validity of the 4 % drop claim, the confidence‑interval approach, and the sample size.

“The daily scale is not statistically significant and is meaningless.” – goldenarm
“They’re reporting statistically significant differences but the methodology is flawed.” – crazygringo
“You need to run the test 5–10 times per day to get a reliable signal.” – ofirpress

4. Transparency and trust in Anthropic’s statements
Users express skepticism about Anthropic’s assurances that they never downgrade models, and demand clearer disclosure of “thinking power” and potential shadow‑downgrades.

“We never reduce model quality due to demand, time of day, or server load.” – anthropic (quoted by many)
“They’re probably resource‑constrained and are silently serving cheaper models.” – arcanemachiner
“Transparency is a big gripe… I would prefer a straight‑no than a silently downgraded answer.” – dmos62

These four themes capture the core of the discussion: a perceived decline in performance, uncertainty over its origin, contention over how to measure it, and a call for greater openness from the provider.


🚀 Project Ideas

Real‑Time AI Performance Dashboard

Summary

  • Continuously fetches benchmark results, latency, token usage, and error rates from multiple LLM providers (Claude, Gemini, Codex, etc.).
  • Applies statistical significance tests to detect degradation or sudden changes.
  • Provides alerts, trend charts, and a public API for developers to embed in their own dashboards.

Details

Key Value
Target Audience AI developers, ops teams, product managers
Core Feature Live performance monitoring with anomaly detection
Tech Stack Python (FastAPI), InfluxDB, Grafana, WebSocket, Docker
Difficulty High
Monetization Revenue‑ready: tiered subscription ($49/mo for basic, $199/mo for enterprise)

Notes

  • HN users like “I can’t be certain” (turnsout) and “I need to know if the model is degrading” (mrandish) would love a dashboard that gives instant confidence.
  • Enables quick triage of “model degradation” vs. “prompt drift” discussions.

Benchmark‑as‑a‑Service (BaaS)

Summary

  • A cloud service that runs a curated, statistically robust benchmark suite (SWE‑Bench‑Pro, Code‑Eval, etc.) on demand.
  • Supports multiple runs per day, configurable sample sizes, and automatic confidence interval calculation.
  • Exposes results via REST API and web UI for comparison across providers.

Details

Key Value
Target Audience Researchers, open‑source projects, SaaS vendors
Core Feature Repeatable, high‑confidence benchmark execution
Tech Stack Go, Kubernetes, PostgreSQL, React, OpenTelemetry
Difficulty Medium
Monetization Revenue‑ready: pay‑per‑run ($0.01/run) or monthly plan ($29/mo)

Notes

  • Addresses frustration “I need to run 300 tasks” (ofirpress) and “I want to compare across providers” (persedes).
  • Provides the “objective” data that many HN commenters crave to validate anecdotal degradation claims.

Context Window Management CLI

Summary

  • A lightweight command‑line tool that automatically splits large prompts into deterministic batches, preserves ordering, and ensures consistent token limits.
  • Includes a “compact‑once‑per‑session” mode to avoid context loss in long sessions.
  • Offers a plug‑in for Claude Code to replace its current React‑based TUI.

Details

Key Value
Target Audience CLI users, developers using Claude Code
Core Feature Deterministic context batching & compaction
Tech Stack Rust, Tokio, serde, clap
Difficulty Medium
Monetization Hobby

Notes

  • Solves “compact‑once‑per‑session” pain (turnsout) and “Claude Code is too slow” (bushbaba).
  • HN users who struggle with “context compaction” (turnsout, fernvenue) will appreciate a reproducible solution.

Transparent Model Version & Quality Indicator

Summary

  • Browser extension/API that intercepts LLM requests, identifies the exact model version, quantization level, and current performance score.
  • Displays a “quality badge” and a warning if the model is serving a downgraded variant.
  • Allows users to opt‑in to a “no‑downgrade” mode that routes traffic to a higher‑quality endpoint if available.

Details

Key Value
Target Audience End‑users, developers, compliance teams
Core Feature Real‑time model version & quality disclosure
Tech Stack TypeScript, Chrome Extension APIs, Node.js, WebSocket
Difficulty Medium
Monetization Revenue‑ready: $5/mo for premium features (e.g., historical logs)

Notes

  • Directly addresses “I want to know if I’m getting the same model” (turnsout) and “Anthropic might be downgrading” (gpm).
  • Provides the transparency that many HN commenters demand (“I need to see what’s happening”).

Agent Harness Health Monitor

Summary

  • A monitoring service for LLM‑powered harnesses (Claude Code, Agentic, etc.) that tracks tool‑calling errors, latency spikes, and unexpected shutdowns.
  • Offers auto‑restart, fallback to a backup harness, and detailed diagnostics via a web UI.
  • Integrates with Slack/Discord for real‑time alerts.

Details

Key Value
Target Audience Ops teams, AI product owners
Core Feature Continuous health checks & automated failover
Tech Stack Python (Celery), Prometheus, Grafana, Docker Compose
Difficulty High
Monetization Revenue‑ready: $99/mo for full monitoring suite

Notes

  • Responds to “Claude Code harness issue” (trq_) and “I get stuck in a loop” (turnsout).
  • Gives teams the confidence that “when the model degrades, traffic shifts automatically” (devonkelley).

Read Later