Summary of Hacker News Discussion on Miasma Tool
1. Ethical Concerns About AI Training Data Collection
Many expressed frustration that AI companies scrape website content without permission, compensation, or respect for robots.txt.
"Multiple AI scrapers are downloading every page of my 6M page website as we speak. They don't care about the fact that I have dedicated 20 years to building it, nor that I have to maintain multiple VPSes just to serve it to them." - spiderfarmer
"I wish if there was some regulation which could force companies who scrape for (profit) to reveal who they are to the end websites, many new AI company don't seem to respect any decision made by the person who owns the website and shares their knowledge for other humans, only for it to get distilled for a few cents." - Imustaskforhelp
2. Effectiveness Debate on the Miasma Tool
Opinions varied widely on whether the tool would actually work against sophisticated scrapers.
"Even it did work, I just can't bring myself to care enough. It doesn't feel like anything I could do on my site would make any material difference. I'm tired." - sd9
"It does work, on two levels: 1. Simple, cheap, easy-to-detect bots will scrape the poison, and feed links to expensive-to-run browser-based bots that you can't detect in any other way. 2. Once you see a browser visit a bullshit link, you insta-ban it, as you can now see that it is a bot because it has been poisoned with the bullshit data." - phoronixrly
"It might work against people just use their Mini Mac with OpenClaw to summarize news every morning, but it certainly won't work against Google. More centralized web ftw." - raincole
3. The Ongoing Arms Race
Many viewed Miasma as just another move in an endless technological battle between scrapers and anti-scraping measures.
"This is ultimately just going to give them training material for how to avoid this crap. They'll have to up their game to get good code. The arms race just took another step, and if you're spending money creating or hosting this kind of content, it's not going to make up for the money you're losing by your other content getting scraped." - aldousd666
"This is essentially machine-generated spam. The irony of machine-generated slop to fight machine-generated slop would be funny, if it weren't for the implications. How long before people start sharing ai-spam lists, both pro-ai and anti-ai?" - ninjagoo
"I feel like trying to prevent AI bots, or any bots, from crawling a public web service, is a similar game of whack-a-mole, but one where you may also end up damaging your service." - CrzyLngPwd
4. Philosophical Questions About Information Ownership
The discussion touched on deeper questions about whether information on the web is "free" to be used for AI training.
"If you want people to read and learn from each other, you should incentivize people to make content worth reading and learning from. Making LLM training a viable loophole for copyright law means there won't be incentives to produce such work." - FromTheFirstIn
"Reading something, learning from it, then writing something similar, is legal; and more importantly, it is moral. There is no violation here. Copyright holders already have plenty of power; they must not be given the power to restrict the output of your brain forever more for merely having read and learnt. Reading and learning is sacred." - feepingcreature
"the very human need to be recognized for something created, made, or thought by a person. People are ok with writing blog posts, they're ok with writing software, and they're ok with give it all for free, but they want their name attached and their contribution recognized." - GeoAtreides