[Wikidata-bugs] [Maniphest] [Created] T252091: RFC: Site-wide edit rate limiting with PoolCounter

Ladsgroup Wed, 06 May 2020 18:38:25 -0700

Ladsgroup created this task.
Ladsgroup added projects: TechCom-RFC, Wikidata.
Restricted Application added a subscriber: Aklapper.


TASK DESCRIPTION
  - Affected components: Mediawiki Core, Wikibase.
  - Engineer(s) or team for initial implementation: WMDE (Wikidata team)
  - Code steward: TBD.
  
  Motivation
  ----------
  
  Wikidata is a unique installation of mediawiki. The edit rate on this wiki 
has been going up to 1,000 edits per minute and has been testing our 
infrastructure scalability since the day it went live. The edits have been 
mostly done by bots and bots have `noratelimit` right meaning no rate limit can 
be applied to them.
  
  The path for forcing a rate limit for bots in Wikidata was followed and 
caused several issues so it had to be rolled back: See T184948: limit page 
creation and edit rate on Wikidata <https://phabricator.wikimedia.org/T184948> 
and T192690: Mass message broken on Wikidata after ratelimit workaround 
<https://phabricator.wikimedia.org/T192690>. One main reasoning is that bot 
operators want to edit in full speed when the infrastructure is quiet and 
forcing an arbitrary number like 100 edits per minute would not solve the issue 
and limits bots in times that the infrastructure can actually take more. This 
also broke MassMessage.
  
  With the current flow of edits, WDQS updater can't keep up and was lagging 
sometimes for days, so now Wikidata considers the median lag of WDQS updater 
(divided by 60) as a number for maxlag (See T221774: Add Wikidata query service 
lag to Wikidata maxlag <https://phabricator.wikimedia.org/T221774>). As a 
matter of policy, bots stop if maxlag is more than 5 (e.g. the maximum 
replication lag from master database to replica is more than five seconds or 
size of the job queue divided by `$jobQueueLagFactor` is bigger than five). 
This means if median of lag of 5 minutes for WDQS is reached, most bots are 
stopped until WDQS updater catches up, then the maxlag goes below five and the 
bots start to edit, then WDQS starts to lag behind and so on. This has been 
oscillating like this for months now:
  (This is an example of the last six hours 
<https://grafana.wikimedia.org/d/000000170/wikidata-edits?panelId=1&fullscreen&orgId=1&from=1588793844896&to=1588810984929>)
  F31805674: image.png <https://phabricator.wikimedia.org/F31805674>
  
  Changing the factor, for example multiplying it by five (300), only changes 
the time period of the oscillation: T244722: increase factor for query service 
that is taken into account for maxlag 
<https://phabricator.wikimedia.org/T244722>
  
  It's important to note that the maxlag approach has been causing disruptions 
for pywikibot and other users that respect maxlag even for read queries. You 
can see more in T243701: Wikidata maxlag repeatedly over 5s since Jan20, 2020 
(primarily caused by the query service) 
<https://phabricator.wikimedia.org/T243701>. Even CI of pywikibot has issues 
because of the maxlag being high all the time: T242081: Pywikibot fails to 
access Wikidata due to high maxlag lately 
<https://phabricator.wikimedia.org/T242081>
  
  The underlying problem of course is WDQS updater not being able to handle the 
sheer flow of edits and it's currently a scalability bottleneck. This is being 
addressed in T244590: EPIC: Rework the WDQS updater as an event driven 
application <https://phabricator.wikimedia.org/T244590> but we need to keep in 
mind that there always will be a bottleneck. We can't just dismiss the problem 
as "WDQS needs to be fixed". Communicating the stress on our infra properly to 
our users so they know when to slow down or stop is important here and maxlag 
approach has been proven failing at this scale.
  
  Requirements
  ------------
  
  - There has to be a way to cap edits rate site-wide without posing a cap on 
bots or individual accounts.
    - This can have multiple buckets, like bots in total should not make too 
many edits so admins will be able to do large batches without getting stuck in 
the same boat with bots.
    - Also page creation in Wikidata is several times more complex than making 
edits and page creations should have a different and smaller cap.
  - Starvation must not happen, meaning an enthusiastic bot eating all the 
quota all the time preventing other bots to edit.
  - No more oscillating behavior
  
  -------
  
  Exploration
  -----------
  
  Proposal One: Semaphores
  ------------------------
  
  This type of problem seems to be already addressed in computer science and 
semaphores <https://en.wikipedia.org/wiki/Semaphore_(programming)> [1] are 
usually the standard solution in these cases. Meaning we will have a dedicated 
semaphore initiated with value of N for bots editing Wikidata, while an edit by 
a bot is being saved, that edit decreases the value of that semaphore and when 
the value reaches zero, more requests has to hold off until one edit is 
finished and then they would wake up one of the waiting connections and the new 
process start saving the edit. If the queue is too long (we can say N), we can 
simply stop and return a "maxlag" reached to bots. First come, first served 
would avoid starvation.
  
  In order to implement this, we can use PoolCounter 
<https://www.mediawiki.org/wiki/PoolCounter> (which is basically a SaaS, 
Semaphore as a service) that has been working reliably in the past couple of 
years. PoolCounter is mostly being used when an article is being reparsed 
already so not too many mw nodes parse an article at the same time (The Michael 
Jackson effect 
<https://blog.wikimedia.org/2016/04/22/prince-death-wikipedia/>). PoolCounter 
is also already used to cap total number of concurrent connections per IP to 
ores services, see T160692: Use poolcounter to limit number of connections to 
ores uwsgi <https://phabricator.wikimedia.org/T160692>.
  
  Implications:
  
  - Using PoolCounter reduces the work needed to implement this as it's already 
well supported by MediaWiki.
  - This would artificially increase the edit saving time when there's too many 
edits happening at the same time.
  - If done incorrectly, processes waiting for the semaphore might hold DB (or 
other) locks for too long or cause a deadlock between the lock held by the 
database by one process while another process is waiting for the semaphore to 
be freed by the first process. Databases have good systems in place to avoid or 
surface deadlocks but we don't have a system to handle deadlocks between 
several locking systems a process might use (db, redis lock manager, 
poolcounter, etc.)
  - If an edit is going to decrease value of several semaphores (e.g. a page 
creation is also an edit) there's a chance of deadlocks due to random latency 
happening in network for different processes waiting for each other.
  
  Proposal Two: Continuous throttling
  -----------------------------------
  
  This has been reflected in T240442: Design a continuous throttling policy for 
Wikidata bots <https://phabricator.wikimedia.org/T240442>. The problem with 
current system is that "maxlag" is a hard limit. We can't tell bots to slow 
down if they are reaching the limit so they continue full speed until 
everything has to stop.
  
  Implications:
  
  - There's no easy way to enforce this to our users
  - There's always chance of starvation caused by bots not respecting the policy
  
  It's worth mentioning that proposal one and two are not mutually exclusive.
  
  [1]: A good and free book for people who are not very familiar with 
semaphores and its applications: The Little Book of Semaphores 
<https://greenteapress.com/wp/semaphores/>

TASK DETAIL
  https://phabricator.wikimedia.org/T252091

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ladsgroup
Cc: Aklapper, Jakob_WMDE, Lydia_Pintscher, WMDE-leszek, darthmon_wmde, 
Addshore, Ladsgroup, DannyS712, Nandana, kostajh, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, _jensen, rosalieper, 
D3r1ck01, Scott_WUaS, Izno, SBisson, Perhelion, Wikidata-bugs, Base, aude, 
GWicke, Bawolff, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, 
Ltrlg, bd808, Legoktm

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Created] T252091: RFC: Site-wide edit rate limiting with PoolCounter

Reply via email to