cross-posted from: https://poptalk.scrubbles.tech/post/3263324

Sorry for the alarming title but, Admins for real, go set up Anubis.

For context, Anubis is essentially a gatekeeper/rate limiter for small services. From them:

(Anubis) is designed to help protect the small internet from the endless storm of requests that flood in from AI companies. Anubis is as lightweight as possible to ensure that everyone can afford to protect the communities closest to them.

It puts forward a challenge that must be solved in order to gain access, and judges how trustworthy a connection is. For the vast majority of real users they will never notice, or will notice a small delay accessing your site the first time. Even smaller scrapers may get by relatively easily.

For big scrapers though, AI and trainers, they get hit with computational problems that waste their compute before being let in. (Trust me, I worked for a company that did “scrape the internet”, and compute is expensive and a constant worry for them, so win win for us!)

Anubis ended up taking maybe 10 minutes to set up. For Lemmy hosters you literally just point your UI proxy at Anubis and point Anubis to Lemmy UI. Very easy and slots right in, minimal setup.

These graphs are since I turned it on less than an hour ago. I have a small instance, only a few people, and immediately my CPU usage has gone down and my requests per minute have gone down. I have already had thousands of requests challenged, I had no idea I was being scraped this much! You can see they’re backing off in the charts.

(FYI, this only stops the web requests, so it does nothing to the API or federation. Those are proxied elsewhere, so it really does only target web scrapers).

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    16 days ago

    Yes, they’re really nasty. Back in the day Google BingBot and all of them would show up in my logs, fetch a bit of content every now and then. And what I’ve seen with AI scrapers is mad. Alibaba and Tencent did tens of requests per second, ignoring all robots.txt and they did that from several large IP ranges. Probably to circumvent rate-limiting and simpler countermeasures. It’s completely unsustainable. My entire server became unresponsive due to the massive database load. I don’t think any single server with dynamic content can do this. It’s just a massive DDoS attack. I wonder if this pays off for them. I’d bet in a year they’re blocked from most of the web, and that doesn’t seem to me like it’s in their interest. They’re probably aware of this? We certainly have no other option.