Agents that run while I sleep

📝 Discussion Summary (Click to expand)

1. “Freeze the tests” – preventing AI from touching test files
Developers want a hard guardrail so Claude can only modify implementation, not the tests.

“I want to get Claude to change the code to get them to pass – but with confidence that it doesn't edit any of the test files!” – BeetleB
“You can remove edit permissions on the test directory.” – SatvikBeri
“You can use a Claude PreToolUse command hook to prevent write (or even read) access to specific files.” – mgrassotti

2. TDD vs. “write‑tests‑after‑the‑fact” – the quality debate
Many argue that writing tests first is a waste of time and that tests written after the code are tautological.

“tests written after the fact are just verifying tautologies.” – SoftTalker
“Most teams don’t write tests first because thinking through what the code should do before writing it takes time they don’t have.” – SoftTalker

3. Agent orchestration patterns (red/green/refactor, clean‑room, sub‑agents)
A recurring theme is the need for separate agents with restricted views to avoid “reward‑gaming” and to keep the workflow honest.

“Red Team writes tests without seeing implementation. Green Team writes code without seeing tests.” – egeozcan
“Red Team writes tests without seeing implementation. Green Team writes code without seeing tests.” – egeozcan

4. Human oversight, review fatigue, and the cost of AI‑assisted coding
Users wrestle with the token‑cost of running agents, the need for constant review, and whether the productivity gains justify the expense.

“I’m not up to speed on Claude’s features. Can I, from the prompt, quickly remove those permissions and then re‑add them?” – BeetleB
“I’m still not really understanding this ‘run agents overnight’ thing.” – mjrbrennan

5. Spec clarity and “spec‑gaming” – ensuring the AI knows what to build
Before any code is written, the spec must be unambiguous; otherwise the AI will happily satisfy a broken or incomplete requirement.

“Spec refinement upstream, holdout validation downstream.” – foundatron (via OctopusGarden)
“Spec refinement upstream, holdout validation downstream.” – foundatron

6. Industry impact & skepticism – who benefits, who loses
The conversation oscillates between hype (AI as a productivity “exoskeleton”) and caution (risk of buggy, unreviewed code, loss of engineering skill).

“I think the ROI with LLMs is very high.” – godelski
“I’m still not really understanding this ‘run agents overnight’ thing.” – mjrbrennan

These six themes capture the core concerns and proposals that dominate the discussion.

🚀 Project Ideas

Test Freeze Manager

Summary

Provides a command‑line and Git‑hook tool that temporarily sets test directories to read‑only while an LLM agent writes code, preventing accidental test edits.
Gives developers instant feedback if the agent tries to modify tests, and automatically restores permissions after the session.

Details

Key	Value
Target Audience	Teams using Claude Code, Copilot, or any LLM‑powered code generator.
Core Feature	Permission toggling, Git pre‑commit hook, real‑time monitoring of file changes.
Tech Stack	Python, Git hooks, OS file‑permission APIs, optional Docker integration.
Difficulty	Medium
Monetization	Revenue‑ready: subscription ($5/month per repo) or per‑use API.

Notes

“I want to freeze the tests so the agent can’t touch them” – a common pain point.
Developers can run freeze-tests before launching an agent and unfreeze-tests afterward, eliminating the mental burden of auditing changes.
The tool can be integrated into CI pipelines to enforce the rule automatically.

Agentic TDD Orchestrator

Summary

A lightweight framework that orchestrates three sub‑agents (Red, Green, Refactor) to enforce a strict red‑green‑refactor cycle for LLM‑generated code.
Each agent has a clear visibility boundary: Red writes failing tests, Green implements code without seeing tests, Refactor cleans up code while keeping tests green.

Details

Key	Value
Target Audience	Developers who want to enforce disciplined TDD with LLMs.
Core Feature	Multi‑agent orchestration, isolation via Docker containers, automated test execution.
Tech Stack	Node.js, Docker, OpenAI/Anthropic APIs, custom agent orchestration library.
Difficulty	High
Monetization	Revenue‑ready: SaaS ($20/month per team) with free tier for hobbyists.

Notes

“Red team writes tests, green team writes code, refactor team cleans up” – a pattern many commenters advocate.
The framework logs each agent’s output, making it easy to audit and review.
It solves the “agent writes tests that pass immediately” problem by preventing the Green agent from seeing the tests.

Spec Validation & Test Generation Service

Summary

A web service that ingests natural‑language specifications, runs ambiguity‑scoring, and automatically generates a suite of acceptance tests before any code is written.
Provides a “spec‑to‑tests” API that can be called from CI or a local CLI.

Details

Key	Value
Target Audience	Product owners, technical writers, and developers who want to lock down specs before LLM coding.
Core Feature	NLP ambiguity detection, test skeleton generation, spec‑to‑test mapping.
Tech Stack	Python, FastAPI, spaCy, OpenAI GPT‑4, PostgreSQL.
Difficulty	Medium
Monetization	Revenue‑ready: pay‑per‑spec ($0.02 per spec) or subscription ($15/month).

Notes

“I want to make sure the spec is unambiguous before the agent writes code” – a recurring frustration.
The service returns a JSON test plan that can be fed directly into an LLM prompt, ensuring the agent has a concrete target.
It also stores historical specs, enabling traceability of test coverage over time.

Test Theatre Analyzer

Summary

Static analysis tool that scans a codebase’s test suite for brittleness, duplication, and coverage gaps, then suggests refactorings and mutation‑testing improvements.
Integrates with popular test frameworks (pytest, Jest, JUnit).

Details

Key	Value
Target Audience	QA engineers, test maintainers, and developers dealing with “useless tests”.
Core Feature	Duplicate test detection, mutation‑testing integration, coverage heat‑maps.
Tech Stack	Go, static analysis libraries, GraphQL API for dashboards.
Difficulty	Medium
Monetization	Hobby (open source) or revenue‑ready: $10/month per repo.

Notes

“Useless tests start growing in count” – a pain point for many.
The tool can be run locally or as a GitHub Action, providing actionable insights before a PR is merged.
It encourages a culture of test hygiene, reducing the “test theatre” problem.

Continuous Test Quality Dashboard

Summary

A CI‑integrated dashboard that tracks test pass rates, coverage trends, mutation scores, and flaky‑test frequency over time.
Alerts teams when test quality drifts below a configurable threshold.

Details

Key	Value
Target Audience	DevOps teams, QA leads, and developers who rely on automated tests.
Core Feature	Real‑time metrics, trend analysis, flaky‑test detection, mutation‑testing integration.
Tech Stack	JavaScript (React), Node.js, Grafana, Prometheus, CI webhook integration.
Difficulty	Medium
Monetization	Revenue‑ready: $25/month per project, with a free tier for open‑source repos.

Notes

“Test coverage is high but tests are fragile” – a common concern.
The dashboard visualizes the health of the test suite, making it easier to spot when new code introduces regressions.
It can be embedded into existing CI dashboards or run as a standalone service.

LLM Code Review Bot

Summary

An automated code‑review bot that focuses on intent, style, security, and architectural consistency for LLM‑generated code.
Produces concise diff reviews and highlights potential issues before a PR is merged.

Details

Key	Value
Target Audience	Teams that use LLMs for code generation and need a lightweight review step.
Core Feature	Diff‑only analysis, style‑guide enforcement, security linting, intent‑matching.
Tech Stack	Python, GitHub API, OpenAI Codex, Bandit, Flake8.
Difficulty	Low
Monetization	Hobby (open source) or revenue‑ready: $5/month per repo.

Notes

“I need to review 20k lines of generated code” – a bottleneck many mention.
By limiting the bot to the diff, it avoids the need to read the entire codebase, speeding up reviews.
The bot can be configured to ignore certain files (e.g., generated assets) and to enforce project‑specific style rules.

Agents that run while I sleep

🚀 Project Ideas

Test Freeze Manager

Summary

Details

Notes

Agentic TDD Orchestrator

Summary

Details

Notes

Spec Validation & Test Generation Service

Summary

Details

Notes

Test Theatre Analyzer

Summary

Details

Notes

Continuous Test Quality Dashboard

Summary

Details

Notes

LLM Code Review Bot

Summary

Details

Notes

Read Later