> > What the WMF infrastructure does, OTOH, is different. So maybe it's a good > idea to add a {{Note}} at the top of the etiquette page clarifying that > these are general MW-related rules, and that for the Foundation > infrastructure people should refer to the Robot policy.
In my opinion, that is not how that page is perceived within the community. API:Etiquette is what a majority of community pages link to and has been used to build bot policies on individual wikis and make assumptions about how requests to the API should be made (adding user agents and checking against the max lag parameter). I would like to see wider community discussion before this change is made, since the Robots Policy explicitly shoves hardware limits at folks (which was different from the wording of the previous policy) and is proposing punitive actions against bot operator if the limits are not followed per the statement "Failure to follow these guidelines may result in your bot being blocked or heavily rate-limited." The new policy isn’t more restrictive than the older one for general > crawling of the site or the API; on the contrary we allow higher limits > than previously stated. But the new policy clarifies a few points and adds > quite a few systems not covered in the old policy… because they didn’t > exist at the time. This is not related to my point above, but y'all explicitly call out "Do not emulate a browser - do not store cookies or execute JavaScript" in the new version of the policy. I don't like the blanket implications of that statement, since that would imply using popular libraries like the Python requests library's sessions feature is explicitly prohibited by this policy (the sessions feature library implements a cookie storage function). Similarly, there are often legitimate reasons to emulate browsers and execute JavaScript (for example, in web measurement research using tools like OpenWPM <https://github.com/openwpm/OpenWPM>) or in cases where a particular URL is checked for liveliness/spam/phishing by a variety of backend systems (think systems that would kick in if a Wikipedia article were to be linked in Discord or Instagram's chat features) that would typically not cause a significant amount of traffic. I would urge y'all to consider qualifying the statement in the policy with metrics ("Do not emulate a browser - do not store cookies or execute JavaScript if crawling more than X pages over Y period") instead of a blanket ban. Regards, Sohom Datta. On Wed, Apr 9, 2025 at 6:22 AM Giuseppe Lavagetto <glavage...@wikimedia.org> wrote: > Hi, > > Thanks for pointing the confusion out. I didn't remember the picturesque > wording in the API:Etiquette page :D > > But I think that page is more about what MediaWiki can do in terms of > rate-limiting, and indeed MediaWiki doesn't do rate-limiting on reads. > > What the WMF infrastructure does, OTOH, is different. So maybe it's a good > idea to add a {{Note}} at the top of the etiquette page clarifying that > these are general MW-related rules, and that for the Foundation > infrastructure people should refer to the Robot policy. > > Do you think that would help? > > Cheers, > > Giuseppe > > On Tue, Apr 8, 2025 at 8:38 PM Novem Linguae <novemling...@gmail.com> > wrote: > >> Hi Giuseppe, >> >> >> >> Thanks for updating the robots policy. I do see some overlap between >> https://wikitech.wikimedia.org/wiki/Robot_policy#Action_API_rules_(i.e._https://en.wikipedia.org/w/api.php?%E2%80%A6_) >> and https://www.mediawiki.org/wiki/API:Etiquette, so it may be worth >> thinking about if one or both of those pages needs an update to keep >> everything in sync. For example API Etiquette doesn’t link to the Robot >> Policy. >> >> >> >> Speaking anecdotally, I didn’t know the Robot Policy existed and I >> assumed API Etiquette was the canonical page for this kind of thing. >> >> >> >> Hope this helps. >> >> >> >> *Novem Linguae* >> >> novemling...@gmail.com >> >> >> >> *From:* Giuseppe Lavagetto <glavage...@wikimedia.org> >> *Sent:* Tuesday, April 8, 2025 8:08 AM >> *To:* Wikimedia developers <wikitech-l@lists.wikimedia.org> >> *Subject:* [Wikitech-l] Updates to the Robot policy >> >> >> >> Hi all, >> >> >> >> I’ve updated our Robot Policy[0], which was vastly outdated, the main >> revision being from 2009. >> >> The new policy isn’t more restrictive than the older one for general >> crawling of the site or the API; on the contrary we allow higher limits >> than previously stated. But the new policy clarifies a few points and adds >> quite a few systems not covered in the old policy… because they didn’t >> exist at the time. >> >> >> >> My intention is to keep this page relevant, one that we update along as >> our infrastructure evolves, trying to direct more and more web spiders and >> high-volume scrapers to use the patterns and reduce their impact on the >> infrastructure. >> >> >> >> This update is a part of a coordinated effort[1] to try to guarantee >> fairer use of our very limited hardware resources to our technical >> community and users, so we will progressively start enforcing these rules >> for non-community users[2] that currently violate these guidelines >> copiously. >> >> >> >> If you have suggestions on how to improve the policy, please use the talk >> page to provide feedback. >> >> >> >> Cheers, >> >> >> >> Giuseppe >> >> >> >> [0] https://wikitech.wikimedia.org/wiki/Robot_policy >> >> [1] See the draft of the annual plan objective here: https://w.wiki/DkD4 >> >> [2] While the general guidelines of the policy apply to any user, the >> goal is not to place restrictions on our community, or any other >> research/community crawler whose behaviour is in line with the >> aforementioned guidelines. In fact, any bot running in toolsforge or cloud >> VPS is already part of an allow-list that excludes this traffic from rate >> limiting we apply at the CDN. >> >> >> >> -- >> >> Giuseppe Lavagetto >> Principal Site Reliability Engineer, Wikimedia Foundation >> _______________________________________________ >> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org >> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org >> >> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >> > > > -- > Giuseppe Lavagetto > Principal Site Reliability Engineer, Wikimedia Foundation > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/