cross-posted from: https://poptalk.scrubbles.tech/post/3263324
Sorry for the alarming title but, Admins for real, go set up Anubis.
For context, Anubis is essentially a gatekeeper/rate limiter for small services. From them:
(Anubis) is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.
It puts forward a challenge that must be solved in order to gain access, and judges how trustworthy a connection is. For the vast majority of real users they will never notice, or will notice a small delay accessing your site the first time. Even smaller scrapers may get by relatively easily.
For big scrapers though, AI and trainers, they get hit with computational problems that waste their compute before being let in. (Trust me, I worked for a company that did “scrape the internet”, and compute is expensive and a constant worry for them, so win win for us!)
Anubis ended up taking maybe 10 minutes to set up. For Lemmy hosters you literally just point your UI proxy at Anubis and point Anubis to Lemmy UI. Very easy and slots right in, minimal setup.
These graphs are since I turned it on less than an hour ago. I have a small instance, only a few people, and immediately my CPU usage has gone down and my requests per minute have gone down. I have already had thousands of requests challenged, I had no idea I was being scraped this much! You can see they’re backing off in the charts.
(FYI, this only stops the web requests, so it does nothing to the API or federation. Those are proxied elsewhere, so it really does only target web scrapers).



Yeah I’m seeing that too, which is what I thought but it didn’t talk to much about the blocklists which is interesting. Overall very simple concept to me, if you want to access the site then great, prove that you’re willing to work for it. I’ve worked for large scraping farms before and for the vast majority they would rather give up than keep doing that over and over. Compute for them is expensive. What takes a few seconds on our machine is tons of wasted compute, which I think is why I get so giddy over it - I love having them waste their money
Sure. It’s a clever idea. I was mainly concerned with the side-effects like removing all information from Google as well, so the modern dynamics Cory Doctorow calls “enshittification” of the internet. And I’m having a hard time. I do development work and occasionally archive content, download videos or transform the stupid website with local events into an RSS feed. I’m getting rate-limited, excluded and blocked left and right. Doing automatic things with websites has turned from 10 lines of Python into an entire ordeal with loading a headless Chromium into an entire Gigabyte or more of RAM, having some auto-clickers dismiss the cookie banners and overlays, do the proof-of-work… I think it turns the internet from an open market of information into something where information isn’t indexed, walled and generally unavailable. It’s also short-lived and cant be archived or innovated upon any more. Or at least just with lots of limitations. I think that’s the flipside. But it has two sides. It’s a clever approach. And it works well for what it’s intended to do. And it’s a welcome alternative to everyone doing the same thing with Cloudflare as the central service provider. And I guess it depends on what exactly people do. Limiting a Fediverse instance isn’t the same as doing it with other platforms and websites. They have another channel to spread information, at least amongst themselves. So I guess lots of my criticism doesn’t apply as harshly as I’ve worded it. But I’m generally a bit sad about the general trend.
Yeah, I’m in that boat with you. I also have done light scraping, and it kills me that I have to purposely make my site harder to scrape - but at the same time big tech is not playing fairly, scraping my site over and over and over again for any and all changes, causing my database to be overloaded and requests way too high. It’s not fair at all. I’d like to think if someone reached out and asked to scrape that I would let them - but honestly I’d rather just give them read API access
Yes, they’re really nasty. Back in the day Google BingBot and all of them would show up in my logs, fetch a bit of content every now and then. And what I’ve seen with AI scrapers is mad. Alibaba and Tencent did tens of requests per second, ignoring all robots.txt and they did that from several large IP ranges. Probably to circumvent rate-limiting and simpler countermeasures. It’s completely unsustainable. My entire server became unresponsive due to the massive database load. I don’t think any single server with dynamic content can do this. It’s just a massive DDoS attack. I wonder if this pays off for them. I’d bet in a year they’re blocked from most of the web, and that doesn’t seem to me like it’s in their interest. They’re probably aware of this? We certainly have no other option.