AI scrapers everywhere
There's a nice article: [The Day I Logged 1 In Every 2000 Public IPv4: Visualizing The AI Scraper DDoS](https://vulpinecitrus.info/blog/one-in-every-2000-ipv4-visualizing-ddos-ai-web-scrapers/) which someone kindly dropped [in the fuck_ai community](https://piefed.social/c/[email protected]/p/2025374/the-day-i-logged-1-in-every-2000-public-ipv4-visualizing-the-ai-scraper-ddos). It's worth reading.
After reading it, I decided to have another look at my logfiles and I'm pretty sure I see the same thing. A good while back I had my small 1-3 people instance nearly melt because of Tencent and Alibaba. I managed to address that by blocking some of their IP ranges. Proceedingly rimu [implemented a Honeypot](https://codeberg.org/rimu/pyfedi/issues/1421), which currently has 14.272 IPs blocked on my instance. And a few database requests got more efficient. But I think the overall situation deteriorated. These days the AI scrapers seem to come from everywhere on the internet.
My logfiles just scroll by again. It's not as bad as when I was hit by the really nasty crawlers. The server keeps up with it, I'm just constantly sending out loads of data. There's a crawler getting caught up in the honeypot almost every minute. But there's also still a near infinite amount of addresses left, querying all kinds of posts, communities and user profiles. And I notice since it's a dysproportionate amount of traffic for a small instance.
I think the honeypot is great. But the AI scrapers have way to many addresses. And in the mid-term we probably need to come up with some more mitigations. At least to cater to smaller servers.
Just wanted to draw some attention to how things change. I've switched my instance to "private" for now, but I'll continue to investigate.
I've made two images myself (for palaver.p3x.de). First is all IPv4 addresses in my nginx log. That includes all Fedi instances and users. But I'm just a small instance and I guess most of the reddish areas are caused by crawlers.

Second image is what's in my honeypot, (dots visible after zooming in):

After reading it, I decided to have another look at my logfiles and I'm pretty sure I see the same thing. A good while back I had my small 1-3 people instance nearly melt because of Tencent and Alibaba. I managed to address that by blocking some of their IP ranges. Proceedingly rimu [implemented a Honeypot](https://codeberg.org/rimu/pyfedi/issues/1421), which currently has 14.272 IPs blocked on my instance. And a few database requests got more efficient. But I think the overall situation deteriorated. These days the AI scrapers seem to come from everywhere on the internet.
My logfiles just scroll by again. It's not as bad as when I was hit by the really nasty crawlers. The server keeps up with it, I'm just constantly sending out loads of data. There's a crawler getting caught up in the honeypot almost every minute. But there's also still a near infinite amount of addresses left, querying all kinds of posts, communities and user profiles. And I notice since it's a dysproportionate amount of traffic for a small instance.
I think the honeypot is great. But the AI scrapers have way to many addresses. And in the mid-term we probably need to come up with some more mitigations. At least to cater to smaller servers.
Just wanted to draw some attention to how things change. I've switched my instance to "private" for now, but I'll continue to investigate.
I've made two images myself (for palaver.p3x.de). First is all IPv4 addresses in my nginx log. That includes all Fedi instances and users. But I'm just a small instance and I guess most of the reddish areas are caused by crawlers.

Second image is what's in my honeypot, (dots visible after zooming in):
