I started a page [0] to track the problem and some of the emerging solutions. I'm still transferring information from a private wiki over, but it would be great to get others to document what they've been using. I'll start expanding on the tools I know about to give more information about the tradeoffs when using them.
Thanks for all the great info so far! I know this has been consuming a lot of web admins time the last few months. [0]: https://www.mediawiki.org/wiki/Handling_web_crawlers On Thu, Apr 24, 2025 at 2:39 PM Bryan Davis <bd...@wikimedia.org> wrote: > On Thu, Apr 24, 2025 at 3:16 PM MusikAnimal <musikani...@gmail.com> wrote: > > > > Note that this exercise of IP range whack-a-mole is nothing new to VPS > tools. I maintain two VPS projects (XTools, WS Export) that constantly > suffer from aggressive web crawlers and disruptive automation. We've been > doing the manual IP block thing for years :( > > An interesting aspect of both of those Cloud VPS projects is that they > are directly linked to from a number of content wikis. I think this > greatly extends their exposure to crawler traffic in general. > > > I suggest the IP denylist be applied to all of WMCS < > https://phabricator.wikimedia.org/T226688>. We're able to get by for > XTools and WS Export because XFF headers were specially enabled for this > counter-abuse purpose. However most VPS tools and all of Toolforge don't > have such luxury. If there are bots pounding away, there's no means to stop > them currently (unless they are good bots with an identifiable UA). Even if > we could detect them, it seems better to reduce the repetitive effort and > give all of WMCS the same treatment. > > You are talking about three completely separate HTTP edges at this > point. They all live on the same core Cloud VPS infrastructure, but > there is no common HTTPS connection between the *.toolforge.org proxy, > the *.wmcloud.org proxy, and the Beta Cluster CDN. The first two share > some nginx stack configuration, but in practice are very different > deployments with independent public IP addresses. The third is > fundamentally a partial clone of the production wiki's CDN edge > although scaled down and missing some newer components that nobody has > yet done the work to introduce. > > > I'll also note that some farms of web crawlers can't feasibly be blocked > whack-a-mole style. This is the situation we're currently dealing with over > at <https://phabricator.wikimedia.org/T384711#10759017>. > > Truly distributed attack patterns (bot net traffic) are really hard to > defend against with just an Apache2 instance. This is actually a place > where someone could try experimenting with some filtering proxy like > Anubis [0], go-away [1], or openappsec [2]. Having some experience > with these tools could then lead us into better discussions about > deploying them more widely or making them easier to use in targeted > projects. > > [0]: https://anubis.techaro.lol/ > [1]: https://git.gammaspectra.live/git/go-away > [2]: https://github.com/openappsec/openappsec > > Bryan > -- > Bryan Davis Wikimedia Foundation > Principal Software Engineer Boise, ID USA > [[m:User:BDavis_(WMF)]] irc: bd808 > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/