We can't have nice things because of AI scrapers

📝 Discussion Summary (Click to expand)

Prevalent Themes in the Hacker News Discussion

1. AI Scrapers Overwhelming Sites with Inefficient Traffic Participants express frustration that AI crawlers bypass efficient data access methods (like bulk downloads) in favor of aggressive, page-by-page scraping, which drains resources. As one user noted, "MetaBrainz is exactly the kind of project AI companies should be supporting—open data, community-maintained, freely available for download in bulk. Instead they’re… Scraping page-by-page (inefficient for everyone)". Another described the impact: "I had to remove it from my site after too many complaints" about the degraded user experience.

2. Futility of Standard Web Protocols Many commenters doubt established tools like robots.txt can effectively curb bot behavior. One stated, "the problem is that they don't listen to the already-established standards. What makes one think they would suddenly start if we add another one or two?". Another dismissed the idea of a new standard for bulk data access, arguing, "AI scrapers already fake user agent headers, ignore robots.txt, and go through botnets to bypass firewall rules. They're not going to put out such a signal if they can help it."

3. Onerous Impact on Small Sites and Open Projects The discussion highlights how resource-intensive scraping harms volunteer-run and low-budget websites. A user shared, "I deleted my web site early 2025… because of AI scraper traffic. It had been up for 22 years." Another lamented the cost: "You've wasted 500Mb of bandwidth… My monthly b/w cost is now around 20-30Gb a month given to scrapers where I was only be using 1-2Gb a month, years prior." These pressures are forcing sites to lock down, "reducing openness" and hurting legitimate users.

4. Technical Solutions as a Limited Defense While tools like Cloudflare's Labyrinth, Anubis, and iocaine are discussed, skepticism remains about their long-term efficacy and side effects. One user observed, "Modern scrapers are using headless chromium which will not see the invisible links, so I'm not sure how long this will be effective." Another pointed out collateral damage: "Cloudflare often destroys the experience for users with shared connections, VPNs, exotic browsers… I had to remove it from my site." The consensus is that as scrapers evolve (e.g., using real browsers), defensive measures face an ongoing arms race.

🚀 Project Ideas

AI-Resistant Data Gateway

Summary

Provides a single, verifiable entry point for AI scrapers to download entire datasets in bulk, bypassing inefficient page-by-page crawling.
Forces scrapers to respect a standardized format for data access, reducing server load for open data projects.
Core value proposition: Empowers data providers to offer a "download everything here" endpoint that is machine-discoverable and cheap to serve.

Details

Key	Value
Target Audience	Open data projects, public interest websites, and small-to-medium site owners overwhelmed by bot traffic.
Core Feature	A standardized protocol (e.g., `/.well-known/llm-dump` or `/.well-known/dataset.json`) that links to compressed data dumps (tarballs, torrents) instead of dynamic HTML pages. Includes a simple reference implementation (CLI tool) for webmasters to generate and host these dumps.
Tech Stack	Static file hosting (S3, GitHub Pages, CDN), Standard HTTP headers, BitTorrent (for decentralization), Optional: Python/Go script to generate dumps from databases.
Difficulty	Low
Monetization	Hobby (Open Source). Could become "Revenue-ready" via a managed service that hosts and distributes these dumps via a CDN tiered by bandwidth usage.

Notes

Why HN commenters would love it: As user crazygringo noted, "something like /llms.txt which contains a list of .txt or .txt.gz files" is a logical solution. User arjie explicitly wished for a "machine-communicable way that suggests an alternative" to page-by-page scraping. This project addresses that exact gap.
Potential for discussion or practical utility: High. It solves the "tragedy of the commons" where AI companies externalize scraping costs. By offering a cheaper alternative (bulk download), it aligns incentives. HN users frequently discuss protocols; proposing a concrete standard would generate significant feedback.

Open Web Shield (Self-Hosted)

Summary

An open-source, self-hostable alternative to Cloudflare's bot management and "AI Labyrinth" services.
Allows site owners to detect and "tarpit" aggressive scrapers without relying on a third-party corporation that might sell data or block legitimate users (e.g., those on VPNs).
Core value proposition: Privacy-preserving bot mitigation that gives control back to the site owner, avoiding the "Cloudflare tax" and privacy concerns.

Details

Key	Value
Target Audience	Individuals and organizations who self-host websites or services and value privacy/independence over convenience.
Core Feature	Reverse proxy middleware that identifies bad actors (based on request patterns, ignoring `robots.txt`, known botnet IPs) and serves them infinite, nonsense HTML pages (tarpitting) or slows them down significantly, while passing legitimate traffic through.
Tech Stack	Rust (for performance), Nginx/OpenResty (as middleware), or a standalone Go/Python proxy. Threat intelligence feeds (IP blocklists).
Difficulty	Medium
Monetization	Hobby (Open Source).

Notes

Why HN commenters would love it: Users like zzzeek and timpera expressed frustration with Cloudflare blocking legitimate users (Firefox, Safari, VPNs) and creating a "bad experience." User rester324 asked, "You can implement this yourself, who is stopping you?" This project answers that by providing the tooling to do so without Cloudflare.
Potential for discussion or practical utility: High. It taps into the anti-Corporate, pro-self-hosting sentiment on HN. Discussions often center on the trade-offs of using third-party services; this offers a way to avoid those trade-offs entirely.

LLM Protocol Negotiator

Summary

A browser extension and server-side script that implements a handshake protocol to distinguish between human browsers, legitimate API clients, and rogue AI scrapers.
It uses a challenge-response mechanism (like Proof of Work or a specific header requirement) that legitimate AI tools (or those trying to be good citizens) can implement, while ignoring resource-intensive scrapers.
Core value proposition: A lightweight mechanism to prove "humanness" or "good intent" without CAPTCHAs, reducing friction for real users while blocking heavy scrapers.

Details

Key	Value
Target Audience	Site owners with high-value, dynamic content (forums, wikis, calculators) where CAPTCHAs are too intrusive but bot traffic is high.
Core Feature	Server sends a cryptographic challenge; the client (browser extension or compliant bot) solves it and sends back a token. If the token is missing or invalid, the server serves a tarpit or blocks the request.
Tech Stack	WebAssembly (for in-browser PoW), Node.js/Python (server-side), Redis (for caching challenges).
Difficulty	Medium/High
Monetization	Hobby (Open Source).

Notes

Why HN commenters would love it: User godelski mentioned Anubis using "proof of work" as a popular solution. Users are frustrated that standard blocking (IP ranges, ASNs) is ineffective against residential proxy botnets (jeroenhd noted these are "just a word for a botnet"). A protocol that forces computational cost on the scraper filters out low-effort spam without blocking humans.
Potential for discussion or practical utility: High. It touches on the technical arms race between scrapers and defenders. HN loves cryptographic puzzles and protocol design, and this addresses the "infinite resources" complaint by forcing scrapers to pay a computational price.

Bulk Data Importer/Archive Generator

Summary

A tool that sits between an API database (like MetaBrainz) and the public, automatically generating and hosting "snapshot" archives of the data in a crawl-friendly format (RSS, sitemaps of archives, or torrent files).
Instead of serving dynamic API requests, it serves static files representing the state of the data at a specific time, allowing scrapers to "catch up" without hitting the live database.
Core value proposition: Turns a high-cost, dynamic API into a low-cost, static file serving operation, effectively decoupling data availability from server load.

Details

Key	Value
Target Audience	API providers, Open Data projects, and Data Scientists needing reliable access to changing datasets.
Core Feature	A daemon that watches a database for changes and generates hourly/daily `.tar.gz` archives or `.torrent` files. It serves these via a dedicated endpoint (e.g., `/archive/daily/`) and advertises them via a standardized header or `robots.txt` directive.
Tech Stack	Go (for concurrency), Docker, Object Storage (S3), BitTorrent tracker (optional).
Difficulty	Medium
Monetization	Hobby (Open Source). Could be "Revenue-ready" as a hosted "Data Relay" service for APIs that can't handle the load.

Notes

Why HN commenters would love it: The discussion highlights nostrademons praising MetaBrainz for offering DB dumps, and squigz pointing out that "inefficient path A" (scraping) vs. "efficient path B" (dumps) is a coordination problem. This tool automates the creation of "Path B."
Potential for discussion or practical utility: Practical utility is immense for the open data community. It directly addresses the "tragedy of the commons" where AI companies kill the golden goose. HN users like dannyobrien emphasize the importance of public goods like Metabrainz; this tool helps protect them.

We can't have nice things because of AI scrapers

Prevalent Themes in the Hacker News Discussion

🚀 Project Ideas

AI-Resistant Data Gateway

Summary

Details

Notes

Open Web Shield (Self-Hosted)

Summary

Details

Notes

LLM Protocol Negotiator

Summary

Details

Notes

Bulk Data Importer/Archive Generator

Summary

Details

Notes

Read Later