Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
That is not entirely true in theory. It is possible to engineer content to have a disproportionate impact on the model performance. But we are talking state of the art research and its a moving target since the models evolve quite fast.