Admins: Set up Anubis ASAP!

Scrubbles@poptalk.scrubbles.tech · 18 days ago

Admins: Set up Anubis ASAP!

Scrubbles@poptalk.scrubbles.tech · 17 days ago

Yeah I’m seeing that too, which is what I thought but it didn’t talk to much about the blocklists which is interesting. Overall very simple concept to me, if you want to access the site then great, prove that you’re willing to work for it. I’ve worked for large scraping farms before and for the vast majority they would rather give up than keep doing that over and over. Compute for them is expensive. What takes a few seconds on our machine is tons of wasted compute, which I think is why I get so giddy over it - I love having them waste their money

hendrik@palaver.p3x.de · edit-2 17 days ago

Sure. It’s a clever idea. I was mainly concerned with the side-effects like removing all information from Google as well, so the modern dynamics Cory Doctorow calls “enshittification” of the internet. And I’m having a hard time. I do development work and occasionally archive content, download videos or transform the stupid website with local events into an RSS feed. I’m getting rate-limited, excluded and blocked left and right. Doing automatic things with websites has turned from 10 lines of Python into an entire ordeal with loading a headless Chromium into an entire Gigabyte or more of RAM, having some auto-clickers dismiss the cookie banners and overlays, do the proof-of-work… I think it turns the internet from an open market of information into something where information isn’t indexed, walled and generally unavailable. It’s also short-lived and cant be archived or innovated upon any more. Or at least just with lots of limitations. I think that’s the flipside. But it has two sides. It’s a clever approach. And it works well for what it’s intended to do. And it’s a welcome alternative to everyone doing the same thing with Cloudflare as the central service provider. And I guess it depends on what exactly people do. Limiting a Fediverse instance isn’t the same as doing it with other platforms and websites. They have another channel to spread information, at least amongst themselves. So I guess lots of my criticism doesn’t apply as harshly as I’ve worded it. But I’m generally a bit sad about the general trend.

Scrubbles@poptalk.scrubbles.tech · 15 days ago

Yeah, I’m in that boat with you. I also have done light scraping, and it kills me that I have to purposely make my site harder to scrape - but at the same time big tech is not playing fairly, scraping my site over and over and over again for any and all changes, causing my database to be overloaded and requests way too high. It’s not fair at all. I’d like to think if someone reached out and asked to scrape that I would let them - but honestly I’d rather just give them read API access

hendrik@palaver.p3x.de · edit-2 15 days ago

Yes, they’re really nasty. Back in the day Google BingBot and all of them would show up in my logs, fetch a bit of content every now and then. And what I’ve seen with AI scrapers is mad. Alibaba and Tencent did tens of requests per second, ignoring all robots.txt and they did that from several large IP ranges. Probably to circumvent rate-limiting and simpler countermeasures. It’s completely unsustainable. My entire server became unresponsive due to the massive database load. I don’t think any single server with dynamic content can do this. It’s just a massive DDoS attack. I wonder if this pays off for them. I’d bet in a year they’re blocked from most of the web, and that doesn’t seem to me like it’s in their interest. They’re probably aware of this? We certainly have no other option.