I started a page [0] to track the problem and some of the emerging
solutions.  I'm still transferring information from a private wiki over,
but it would be great to get others to document what they've been using.
I'll start expanding on the tools I know about to give more information
about the tradeoffs when using them.

Thanks for all the great info so far!  I know this has been consuming a lot
of web admins time the last few months.

[0]: https://www.mediawiki.org/wiki/Handling_web_crawlers


On Thu, Apr 24, 2025 at 2:39 PM Bryan Davis <bd...@wikimedia.org> wrote:

> On Thu, Apr 24, 2025 at 3:16 PM MusikAnimal <musikani...@gmail.com> wrote:
> >
> > Note that this exercise of IP range whack-a-mole is nothing new to VPS
> tools. I maintain two VPS projects (XTools, WS Export) that constantly
> suffer from aggressive web crawlers and disruptive automation. We've been
> doing the manual IP block thing for years :(
>
> An interesting aspect of both of those Cloud VPS projects is that they
> are directly linked to from a number of content wikis. I think this
> greatly extends their exposure to crawler traffic in general.
>
> > I suggest the IP denylist be applied to all of WMCS <
> https://phabricator.wikimedia.org/T226688>. We're able to get by for
> XTools and WS Export because XFF headers were specially enabled for this
> counter-abuse purpose. However most VPS tools and all of Toolforge don't
> have such luxury. If there are bots pounding away, there's no means to stop
> them currently (unless they are good bots with an identifiable UA). Even if
> we could detect them, it seems better to reduce the repetitive effort and
> give all of WMCS the same treatment.
>
> You are talking about three completely separate HTTP edges at this
> point. They all live on the same core Cloud VPS infrastructure, but
> there is no common HTTPS connection between the *.toolforge.org proxy,
> the *.wmcloud.org proxy, and the Beta Cluster CDN. The first two share
> some nginx stack configuration, but in practice are very different
> deployments with independent public IP addresses. The third is
> fundamentally a partial clone of the production wiki's CDN edge
> although scaled down and missing some newer components that nobody has
> yet done the work to introduce.
>
> > I'll also note that some farms of web crawlers can't feasibly be blocked
> whack-a-mole style. This is the situation we're currently dealing with over
> at <https://phabricator.wikimedia.org/T384711#10759017>.
>
> Truly distributed attack patterns (bot net traffic) are really hard to
> defend against with just an Apache2 instance. This is actually a place
> where someone could try experimenting with some filtering proxy like
> Anubis [0], go-away [1], or openappsec [2]. Having some experience
> with these tools could then lead us into better discussions about
> deploying them more widely or making them easier to use in targeted
> projects.
>
> [0]: https://anubis.techaro.lol/
> [1]: https://git.gammaspectra.live/git/go-away
> [2]: https://github.com/openappsec/openappsec
>
> Bryan
> --
> Bryan Davis                                        Wikimedia Foundation
> Principal Software Engineer                               Boise, ID USA
> [[m:User:BDavis_(WMF)]]                                      irc: bd808
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to