[re-posting here as I inadvertently just replied to Federico privately; ah
the joys of MUAs...]
On Thu, Apr 10, 2025 at 8:56 PM Federico Leva (Nemo) <nemow...@gmail.com>
wrote:

>
> >
> > The new policy isn’t more restrictive than the older one for general
> > crawling of the site or the API; on the contrary we allow higher limits
> > than previously stated.
>
> I find this hard to believe, considering this new sentence for
> upload.wikimedia.org: «Always keep a total concurrency of at most 2, and
> limit your total download speed to 25 Mbps (as measured over 10 second
> intervals).»
>
> This is a ridiculously low limit. It's a speed which is easy to breach
> in casual browsing of Wikimedia Commons categories, let alone with any
> kind of media-related bots.
>
>
First of all, each of the limits explicitly exclude web browsers and human
activity in general.
This limit (that we can discuss, see below) is intended to ensure a single
unidentified agent  cannot use a significant slice of our available
resources.
Second, there was no stated limit on download of media files in the policy
IIRC, because it was written in 2009 when media downloads weren't as big of
an issue, which is why the quote you report clearly states "the site or the
api" - any limit imposed on media downloads is indeed by default more
restrictive.

It was never my goal, in updating the policy, to limit what can be done;
but rather to get eventually to a point where we can safely identify if
some traffic is
coming from a user, a high volume bot we've identified, or random traffic
from the internet.
It will help both reduce the constant stream of incidents related to
predatory downloading of our images while reducing impact on legitimate
users[1].

Simply put, I want to be able to know who's doing what, and be able to put
general limits on unidentified actors that we can determine clearly aren't
a user-run browser.
As you can imagine, I have a personal interest in this - moving from the
game of whack-a-mole SRE plays nowadays to systematic enforcement of limits
on unidentified clients will improve my own quality of life.

I have no interest nor intention to prevent people from archiving
wikipedia, nor I guess would the community, which I hope could eventually
grant tiers of usage to individual bots, leaving me/us only the role of
defining said tiers.

It was never my intention, in writing the limits, to impede any activity,
but rather to put ourselves in a position where we're more aware of who is
doing what.

I appreciate that some exceptions for Wikimedia Cloud bots were added
> after the discussion at
> https://phabricator.wikimedia.org/T391020#10716478 , but the fact
> remains that this comes off as a big change.
>
>
Actually, the exception for WMCS, which has been around for years, has been
a pillar of the policy since I've written the first draft. Protecting
community use while also protecting the infrastructure (and, honestly, my
weekends :) ) has always been my main goal.

Having said all of the above, I see how the 25 Mbps limit seems stringent;
in evaluating it, let me explain how I got to that number:
* Because of the nature of media downloads, it will be extremely hard for
us to enforce limits that are not per-IP - I don't want to get into more
details on that, but let me just say that rate-limiting fairly usage of our
media serving infrastructure isn't simple, especially if you're trying very
hard to not interfere with human consumption.
* I calculated what sustained bandwidth we can support in a single
datacenter without saturating more than 80% of our outgoing links, if a
crawler uses a number of different IP addresses as large as the largest
we've seen from one of these abuse actors.

So yes, the number is probably a bit defensive, and we can reason if that's
enough for a non-archival bot usage.
I'd argue I'd be happy if an archival tool uses and needs more resources; I
would also like to be able to not worry about it and/or block it in an
emergency.

Again, the reason I've asked for feedback is I'm open to changing things,
in particular the numbers I've settled on, which are of course coming from
the perspective of someone trying to preserve resources for consumption.

If you have a suggestion about what you think would be a more reasonable
default limit, considering the above, please do so on the talk page. If you
have suggestions to make it clearer what's the intention of the policy,
those are also welcome of course.


Cheers,

Giuseppe
[1] To make an example with a screwup of mine: two weekends ago, a
predatory scraper masking as Google Chrome and coming from all over the
internet brought down our media backend serving twice. I and others
intervened and saved the situation, but the ban I created was casting a
little to large a net, and I forgot to remove it eventually which ended up
causing issues to users, see https://w.wiki/Dmfn
-- 
Giuseppe Lavagetto
Principal Site Reliability Engineer, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to