The Brazil IP addresses are expected to be a bunch of compromised TV set-top boxes being used by an AI scraper. They are difficult to block, and being on a residential ISP makes collateral damage all but certain.
https://anubis.techaro.lol/ is currently being deployed by a number of other sites, small and large, from the Arch Wiki to UNESCO. It is MIT licensed, sits between a front proxy and the appserver, and uses a proof-of-work CAPTCHA to prevent bots. It is a blunt hammer, but it's probably better than IP blocking. There is some ability to allow acceptable bots: https://anubis.techaro.lol/docs/admin/policies/ https://git.gammaspectra.live/git/go-away is a similar project with more configuration available, but I haven't heard as many folks deploying it. I don't like advocating for these masures. I'm not sure there are any other reasonable options for resource-limited projects. On Wed, Apr 23, 2025, at 7:36 PM, Bryan Davis wrote: > On Tue, Apr 15, 2025 at 2:27 PM Bryan Davis <bd...@wikimedia.org> wrote: >> >> I just wanted to give folks a heads up that in response to a few >> traffic storms in the Beta Cluster (deployment-prep CLoud VPS project) >> we have started using the very coarse protection of blocking IP >> ranges. These blocks are being applied at the Beta Cluster CDN edge >> where we have Varnish configuration that can discard traffic based on >> a list of CIDR ranges. >> >> The ranges blocked at any point in time should be visible in the >> deployment-prep project's Hiera configuration that is logged in the >> cloud/instance-puppet.git repo. [0] >> >> The hardly scientific process of choosing what to block so far has >> been done with processes like the one documented at >> https://phabricator.wikimedia.org/T392003. Hashar came up with a shell >> one-liner to count requests by IP address or IP address prefix >> depending on the regex provided. We then take the top addresses >> produced by that log filtering and perform a `whois` lookup to find >> the associated IP address allocation. The CIDR blocks associated with >> the allocation are then put into hiera config, a Puppet run is forced, >> and Varnish is restarted. Repeat as necessary to get to a reasonable >> rate of requests passing through Varnish to the backing MediaWiki >> instances where we are examining the logs. > > A week goes by and we find ourselves back in the same "beta crushed by > bot traffic" place again. [2] I tried blocking selectively at first > [3], but I was not making much progress in lowering the load. After > noticing that a lot of the traffic was coming from ranges assigned to > orgs in Brazil I tried blocking a lot of Class B networks (X.Y.0.0/16) > that were on https://ipnetinfo.com/country/BR and showing traffic in > the logs. [4] This helped a bit, but things were still looking pretty > bad. > > I got frustrated and decided to see if blocking Class A networks > (X.0.0.0/8) would do anything. I wrote a delightfully horrible script > that buckets the last 50,000 requests by Class A network and outputs a > cut-and-paste ready list of all of them with more than 500 requests. > [5] I blocked these IP ranges, waited to see what happened for a bit, > and repeated a few times. > > This seems to have worked so far, but does not make me very happy. The > blocks are really wide and almost certain to sweep up legitimate > traffic sooner or later if we keep doing things this way. We have some > newer tools in use with the production networks that might make it > easier for us to rate limit aggressively at the edge rather than > applying outright blocks to large ranges. > >> If you feel that you have legitimate traffic for the Beta Cluster to >> handle that has gotten swept up in one of these blocks, please reach >> out by filing task on the #beta-cluster-infrastructure Phabricator >> board. [1] >> >> If you think working to make this process of blocking easier or >> unnecessary sounds like a fun project I would love to chat more. Hit >> me up via email, libera.chat irc, or on-wiki with your ideas. >> >> [0]: >> https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/refs/heads/master/deployment-prep/_.yaml >> [1]: https://phabricator.wikimedia.org/tag/beta-cluster-infrastructure/ > > [2]: https://phabricator.wikimedia.org/T392534 > [3]: https://phabricator.wikimedia.org/T392534#10763059 > [4]: https://phabricator.wikimedia.org/T392534#10763134 > [5]: https://phabricator.wikimedia.org/T392534#10763235 > > Bryan > -- > Bryan Davis Wikimedia Foundation > Principal Software Engineer Boise, ID USA > [[m:User:BDavis_(WMF)]] irc: bd808 > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ -- AntiCompositeNumber (they/them) _______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/