The Brazil IP addresses are expected to be a bunch of compromised TV set-top 
boxes being used by an AI scraper. They are difficult to block, and being on a 
residential ISP makes collateral damage all but certain.

https://anubis.techaro.lol/ is currently being deployed by a number of other 
sites, small and large, from the Arch Wiki to UNESCO. It is MIT licensed, sits 
between a front proxy and the appserver, and uses a proof-of-work CAPTCHA to 
prevent bots. It is a blunt hammer, but it's probably better than IP blocking. 
There is some ability to allow acceptable bots: 
https://anubis.techaro.lol/docs/admin/policies/

https://git.gammaspectra.live/git/go-away is a similar project with more 
configuration available, but I haven't heard as many folks deploying it.

I don't like advocating for these masures.
I'm not sure there are any other reasonable options for resource-limited 
projects.

On Wed, Apr 23, 2025, at 7:36 PM, Bryan Davis wrote:
> On Tue, Apr 15, 2025 at 2:27 PM Bryan Davis <bd...@wikimedia.org> wrote:
>>
>> I just wanted to give folks a heads up that in response to a few
>> traffic storms in the Beta Cluster (deployment-prep CLoud VPS project)
>> we have started using the very coarse protection of blocking IP
>> ranges. These blocks are being applied at the Beta Cluster CDN edge
>> where we have Varnish configuration that can discard traffic based on
>> a list of CIDR ranges.
>>
>> The ranges blocked at any point in time should be visible in the
>> deployment-prep project's Hiera configuration that is logged in the
>> cloud/instance-puppet.git repo. [0]
>>
>> The hardly scientific process of choosing what to block so far has
>> been done with processes like the one documented at
>> https://phabricator.wikimedia.org/T392003. Hashar came up with a shell
>> one-liner to count requests by IP address or IP address prefix
>> depending on the regex provided. We then take the top addresses
>> produced by that log filtering and perform a `whois` lookup to find
>> the associated IP address allocation. The CIDR blocks associated with
>> the allocation are then put into hiera config, a Puppet run is forced,
>> and Varnish is restarted. Repeat as necessary to get to a reasonable
>> rate of requests passing through Varnish to the backing MediaWiki
>> instances where we are examining the logs.
>
> A week goes by and we find ourselves back in the same "beta crushed by
> bot traffic" place again. [2] I tried blocking selectively at first
> [3], but I was not making much progress in lowering the load. After
> noticing that a lot of the traffic was coming from ranges assigned to
> orgs in Brazil I tried blocking a lot of Class B networks (X.Y.0.0/16)
> that were on https://ipnetinfo.com/country/BR and showing traffic in
> the logs. [4] This helped a bit, but things were still looking pretty
> bad.
>
> I got frustrated and decided to see if blocking Class A networks
> (X.0.0.0/8) would do anything. I wrote a delightfully horrible script
> that buckets the last 50,000 requests by Class A network and outputs a
> cut-and-paste ready list of all of them with more than 500 requests.
> [5] I blocked these IP ranges, waited to see what happened for a bit,
> and repeated a few times.
>
> This seems to have worked so far, but does not make me very happy. The
> blocks are really wide and almost certain to sweep up legitimate
> traffic sooner or later if we keep doing things this way. We have some
> newer tools in use with the production networks that might make it
> easier for us to rate limit aggressively at the edge rather than
> applying outright blocks to large ranges.
>
>> If you feel that you have legitimate traffic for the Beta Cluster to
>> handle that has gotten swept up in one of these blocks, please reach
>> out by filing task on the #beta-cluster-infrastructure Phabricator
>> board. [1]
>>
>> If you think working to make this process of blocking easier or
>> unnecessary sounds like a fun project I would love to chat more. Hit
>> me up via email, libera.chat irc, or on-wiki with your ideas.
>>
>> [0]: 
>> https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/refs/heads/master/deployment-prep/_.yaml
>> [1]: https://phabricator.wikimedia.org/tag/beta-cluster-infrastructure/
>
> [2]: https://phabricator.wikimedia.org/T392534
> [3]: https://phabricator.wikimedia.org/T392534#10763059
> [4]: https://phabricator.wikimedia.org/T392534#10763134
> [5]: https://phabricator.wikimedia.org/T392534#10763235
>
> Bryan
> -- 
> Bryan Davis                                        Wikimedia Foundation
> Principal Software Engineer                               Boise, ID USA
> [[m:User:BDavis_(WMF)]]                                      irc: bd808
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

-- 
AntiCompositeNumber
(they/them)
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to