On Wed, Apr 9, 2025 at 1:12 PM Sohom Datta <dattasoh...@gmail.com> wrote:
> What the WMF infrastructure does, OTOH, is different. So maybe it's a good >> idea to add a {{Note}} at the top of the etiquette page clarifying that >> these are general MW-related rules, and that for the Foundation >> infrastructure people should refer to the Robot policy. > > In my opinion, that is not how that page is perceived within the > community. API:Etiquette is what a majority of community pages link to and > has been used to build bot policies on individual wikis and make > assumptions about how requests to the API should be made (adding user > agents and checking against the max lag parameter). I would like to see > wider community discussion before this change is made, since the Robots > Policy explicitly shoves hardware limits at folks (which was different from > the wording of the previous policy) and is proposing punitive actions > against bot operator if the limits are not followed per the statement > "Failure to follow these guidelines may result in your bot being blocked or > heavily rate-limited." > My proposal was to avoid the confusion expressed here, but your point is valid; I will add an entry to the talk page about it. In fact, I was also unsure if it's a good idea. In terms of "punitive actions" - we've always blocked abusive bots when in need. It was an attempt on my side to clarify what can happen if the guidelines are not followed. Given we plan to gradually move from case-by-case enforcement of limits to a more systematic one, it seems important to me to underline that it is a real risk if you don't respect the guidelines and you don't reach out to us to get rate-limited. > > > The new policy isn’t more restrictive than the older one for general >> crawling of the site or the API; on the contrary we allow higher limits >> than previously stated. But the new policy clarifies a few points and adds >> quite a few systems not covered in the old policy… because they didn’t >> exist at the time. > > This is not related to my point above, but y'all explicitly call out "Do > not emulate a browser - do not store cookies or execute JavaScript" in > the new version of the policy. I don't like the blanket implications of > that statement, since that would imply using popular libraries like the > Python requests library's sessions feature is explicitly prohibited by this > policy (the sessions feature library implements a cookie storage function). > Similarly, there are often legitimate reasons to emulate browsers and > execute JavaScript (for example, in web measurement research using tools > like OpenWPM <https://github.com/openwpm/OpenWPM>) or in cases where a > particular URL is checked for liveliness/spam/phishing by a variety of > backend systems (think systems that would kick in if a Wikipedia article > were to be linked in Discord or Instagram's chat features) that > would typically not cause a significant amount of traffic. I would urge > y'all to consider qualifying the statement in the policy with metrics ("Do > not emulate a browser - do not store cookies or execute JavaScript if > crawling more than X pages over Y period") instead of a blanket ban. > > This is a valid point too, indeed that point is aimed at high-volume crawlers and not all bots. I'll make an amendment, also clarifying the scope for that requirement (read requests while crawling the site). I encourage you to propose any improvements and/or suggestions on the talk page. The reason I sent the announcement email here is exactly because I hope to get this kind of feedback. Cheers, Giuseppe -- Giuseppe Lavagetto Principal Site Reliability Engineer, Wikimedia Foundation
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/