| Joe added a comment. |
I did some more number crunching on the instances of runJob.php I'm running on terbium, I found what follows:
Wikibase refreshlinks jobs might benefit from being in smaller batches, as many of those are taking a long time to execute. Out of 33.4k wikibase jobs, we had the following distribution of timings:
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 50000) print res[2] }' | wc -l
3418
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 30000) print res[2] }' | wc -l
10814
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 20000) print res[2] }' | wc -l
13430
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 10000) print res[2] }' | wc -l
16949
oblivian@terbium:~$ fgrep Wikibase refreshlinks.log.* | awk '{ if ($NF == "good") split($(NF-1),res,"="); if (res[2] > 5000) print res[2] }' | wc -l
21394As you can see, about 10% of jobs take 50 seconds or higher to execute, and about 64% take more than 5 seconds to execute, while I think it should be expected to have 99% of the jobs execute within 5 seconds. Also, I could see these jobs easily exceeding the maxtime of Jobrunner::Run, which is set to 30 seconds in production.
Also, there is no way with the current jobqueue to schedule jobs from wikis with more backlog than others, so for example jobrunner submitted only 250 batches/server for refreshLinks on commons yesterday, way less than the number posted for itwiki (around 350/400 per server), whose queue is 4k elements long right now, versus the 680k elements in the queue for commonswiki. This resulted in a total of 37.1k refreshLinks jobs for commonswiki being executed by the jobrunner infrastructure in a full day.
For comparison, my three threads on terbium completed a total of 35.7k jobs in the same interval.
It is pretty clear that unless we have a better scheduler or manual ways to control the jobqueue processing priority, there is no way we can recover a 700k items lag anytime soon.
I'll let my threads work to fight this specific fire, but either we fix things in the jobqueue, or we expect such things to happen until we have fully migrated to the new change-propagation backed transport that should make some of these controls easier to work with.
Cc: Nikerabbit, Mholloway, Legoktm, ema, Joe, GWicke, Nemo_bis, Andreasmperu, BBlack, Peachey88, Liuxinyu970226, daniel, Stashbot, Agabi10, Daniel_Mietchen, Harej, XXN, Pasleim, Bugreporter, Sjoerddebruin, Magnus, Mr.Ibrahem, gerritbot, EBernhardson, Esc3300, jcrespo, WMDE-leszek, Jdforrester-WMF, Krinkle, aaron, fgiunchedi, Aklapper, Ladsgroup, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, EBjune, Vali.matei, Avner, Lewizho99, Zppix, Maathavan, debt, Gehel, FloNight, Izno, Eevans, mobrovac, Hardikj, Wikidata-bugs, aude, jayvdb, faidon, Mbch331, Jay8g, jeremyb
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
