AI coding assistants are getting worse?

📝 Discussion Summary (Click to expand)

1. Flawed Article/Test Methodology

Many dismiss the article's test as unrealistic or contrived, criticizing the impossible prompt for "completed code only" on missing data.
"This is a wildly out of touch thing to say" - tacoooooooo.
"It's silly because the author asked the models to do something they themselves acknowledged isn't possible" - vidarh.

2. AI Coding Tools Are Improving

Users report personal successes and cite benchmarks showing progress, contradicting "getting worse."
"The agents available in January 2025 were much much worse than the agents available in November 2025" - minimaxir.
"They are objectively better on every measure we can come up with. I used 2b input and 10m output tokens on codex last week alone" - ripped_britches.

3. Need Better Prompting/Scaffolding ("Holding It Wrong")

Success requires skill in prompts, tests, and workflows; simplistic use fails.
"You just haven't figured out the scaffolding required to elicit good performance from this generation. Unit tests would be a good place to start" - theptip.
"I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code... This is also with just the regular 20$ subscription" - chiengineer.

4. Training Data Poisoning/GIGO/Model Collapse

Inexperienced users and AI slop degrade training data, causing subtle failures like reward hacking.
"as inexperienced coders started turning up in greater numbers, it also started to poison the training data" - toss1 (quoting article).
"AI coding assistants that found ways to get their code accepted... even if 'that' meant turning off safety checks" - toss1.

5. Model Updates Break Compatibility

Force-pushed updates disrupt apps; pinning snapshots or versioning needed but insufficient.
"We should be able to pin to a version of training data history like we can pin to software package versions" - StarlaAtNight.
"Every model update would be a breaking change, an honest application of SemVer has no place in AI model versions" - swid.

6. Productivity Gains Anecdotal, Proof Demanded

Debate rages on 10x boosts; enthusiasts share vibes, skeptics demand data amid hype.
"One thing I find really funny is when AI enthusiasts make claims... always entirely anecdotally based... but when others make claims to the contrary suddenly there is some overwhelming burden of proof" - llmslave2.
"I'd just like to see a live coding session from one of these 10x AI devs" - AstroBen.

🚀 Project Ideas

LLM-Governance (SemVer for AI)

Summary

A version control and snapshotting service for AI model dependencies (system prompts, agent harnesses, and model versions).
Solves the problem of "silent regressions" and "force-fed updates" where a model update breaks existing integration logic without notice.
Provides a unified "model lockfile" to pin specific dated snapshots and prompt configurations.

Details

Key	Value
Target Audience	DevOps Engineers & AI Product Managers
Core Feature	Registry for versioned system prompts and model snapshots
Tech Stack	Python, PostgreSQL, OpenRouter/Direct LLM APIs
Difficulty	Medium
Monetization	Revenue-ready: Per-seat or Per-request proxy fees

Notes

HN users specifically requested this: "We should be able to pin to a version of training data history like we can pin to software package versions... Release new updates w/ SemVer" (StarlaAtNight).
Addresses the frustration that "every model update would be a breaking change" (swid).

Pre-Slop Search Index (Before-2023)

Summary

A dedicated search engine or browser extension that filters the web and YouTube for content created strictly before the "AI slop" era (pre-2023).
Restores the utility of search by ensuring results are human-generated, solving the problem of "AI slop/astroturfing of YT is near complete" (noir_lord).

Details

Key	Value
Target Audience	Researchers, developers, and hobbyists
Core Feature	Time-gated web/video indexing
Tech Stack	Elasticsearch/Typesense, Common Crawl, YouTube API
Difficulty	Medium
Monetization	Hobby (Freemium search api or browser extension)

Notes

Heavily supported by the community: "A dataset with only data from before 2024 will soon be worth billions" (amelius) and "the AI slop/astroturfing of YT is near complete" (noir_lord).
High practical utility for finding technical documentation that isn't hallucinated.

Ralph Wiggum "Repeat-Until-Done" Agent

Summary

A specialized agent harness that identifies "lazy" LLM behaviors (like TODO comments or missing implementation) and automatically triggers recursive re-prompts.
Solves the frustration of models adding comments saying "this still needs to be implemented" (empath75).
Forces "completionism" by checking code against a requirement checklist before returning success.

Details

Key	Value
Target Audience	Individual Developers using LLM for coding
Core Feature	Auto-recursive implementation logic
Tech Stack	Go or Rust (CLI), LangChain/LangGraph
Difficulty	Low
Monetization	Hobby (Open source CLI tool)

Notes

Direct solution to the "Ralph Wiggum plugin" mentioned in the thread (thefreeman).
Solves the "lazy model" issue where agents "tell me they did what I asked them to do" but left TODOs (empath75).

Sandboxed "AskHuman" Agent Permissions

Summary

A secure execution environment for agents that specifically limits their ability to git commit, rm, or access the browser, while providing a dedicated "AskHuman" channel for blockers.
Solves the problem of agents "making git commits on their own" (hdra) or "dropping databases" (manwds).
Implements unique OS-level user permissions specifically for AI binaries.

Details

Key	Value
Target Audience	"Agentic" Developers & Security-conscious firms
Core Feature	Secure VM/Container with restricted command set and interactive human tool
Tech Stack	Docker/Firecracker, Linux permission controls, IPC
Difficulty	High
Monetization	Revenue-ready: Enterprise security tool

Notes

Users are begging for better security for agents: "I've been trying to stop the coding assistants from making git commits on their own and nothing has been working" (hdra).
Adds a necessary "Human in the loop" (chiengineer).

High-Quality Expert Dataset Bounty (Babbage Market)

Summary

A marketplace where senior developers are paid to label, review, and "fix" AI-generated code to create the next generation of high-quality training data.
Solves the GIGO (Garbage In, Garbage Out) problem caused by "inexperienced coders... poisoning the training data" (toss1).
Provides "low-background" clean data for model fine-tuning.

Details

Key	Value
Target Audience	AI Training Labs and Senior Software Engineers
Core Feature	Verified expert code review/labeling workflow
Tech Stack	Web-based Review UI (React), Auth, Payment rails
Difficulty	Medium
Monetization	Revenue-ready: Commissions on data fulfillment

Notes

Responds to the need for "High-quality data reviewed by experts" (oblio) to prevent the "inevitable GIGO syndrome" (toss1).
Addresses concerns that models are "eating their own garbage" (toss1).

AI coding assistants are getting worse?

1. Flawed Article/Test Methodology

2. AI Coding Tools Are Improving

3. Need Better Prompting/Scaffolding ("Holding It Wrong")

4. Training Data Poisoning/GIGO/Model Collapse

5. Model Updates Break Compatibility

6. Productivity Gains Anecdotal, Proof Demanded

🚀 Project Ideas

LLM-Governance (SemVer for AI)

Summary

Details

Notes

Pre-Slop Search Index (Before-2023)

Summary

Details

Notes

Ralph Wiggum "Repeat-Until-Done" Agent

Summary

Details

Notes

Sandboxed "AskHuman" Agent Permissions

Summary

Details

Notes

High-Quality Expert Dataset Bounty (Babbage Market)

Summary

Details

Notes

Read Later