1. Publishers are actively blocking the Internet Archive and other crawlers
“Publishers like The Guardian and NYT are blocking the IA/Wayback Machine… 20 % of news sites are blocking both IA and Common Crawl.” – ninjagoo
“The Financial Times, for example, blocks any bot that tries to scrape its pay‑walled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive.” – shevy‑java
2. AI‑driven scraping is the main driver of the block, not just generic bots
“AI training will be hard to police… the problem is that these sites want to make money with ads and paywalls that an archived copy tends to omit by design.” – bmiekre
“AI companies keep coming back even if everything is the same.” – CqtGLRGcukpy
3. The debate over what should be preserved – “AI slop” vs. historically valuable content
“If most of the Internet is AI‑generated slop… is there really any value in expensing so much bandwidth and storage to preserve it?” – OGEnthusiast
“The unarchivability of news and other useful content has implications for future public discourse, historians, legal matters…” – ninjagoo
4. Legal and compliance pressure is turning the issue into a business‑risk problem
“Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention… a third‑party vendor’s published security policy that they referenced in their own controls no longer exists at the URL they cited.” – kevincloudsec
“If a website is open to the public, shouldn’t it be archivable?” – ninjagoo (echoed by many commenters)
These four threads—blocking, AI‑scraping, value of preservation, and compliance risk—capture the core concerns of the discussion.