Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
We need to host the data version of asbestos. Very appealing and useful, a miracle material in fact, and you don’t realise until 30 years later and well after it’s too late that it’s causing an incurable disease in your lungs.
Get that poisonous data so deep in the databases of these AIs that it festers and spawns billions of tumors.
I wish I was smart enough to devise a practical way to weaponise data like this.
Misinformation?
E.g. “Asbestos is good for your diet”