Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
As you’ve said elsewhere, you’ve created a crawler trap, not a way to poison a model. You’re wasting… some resources I guess? Both theirs and your own. Fascinating to think that you’ve served a billion http requests to no benefit to anyone and you believe this is you winning somehow.
Yes, it does have a cost. It has a far smaller cost than serving the real thing. It also allows me to firewall them off and stop serving them, even if they come at me with real browsers. That’s a very definitive win: I saved CPU time, I saved RAM, I saved network bandwidth, and I stopped them from accessing my stuff. How is that not a win?