[Wikidata-bugs] [Maniphest] [Updated] T198049: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC

2018-08-07 Thread tstarling
tstarling added a comment.
db1071, the master, had no writes

It actually had a factor of 10 fewer writes, not zero writes.

I'm pretty sure there was no outage.

I had a closer look at the exceptions. Most of them came from jobs. There's a sizeable minority that came from appservers. Sampling a few shows they are from DeferredUpdates. For example LinksUpdate and GlobalUsage write batches of rows in a loop with commitAndWaitForReplication(). This means updates are being skipped. Unlike a job there is no possibility of a retry. This was a change made in 2016 as part of T95501. I'll file a bug for this. Otherwise, I think I'm done with this incident.TASK DETAILhttps://phabricator.wikimedia.org/T198049EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: tstarlingCc: tstarling, jcrespo, hoo, Lydia_Pintscher, daniel, Ladsgroup, Marostegui, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T198049: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC

2018-08-07 Thread tstarling
tstarling added a comment.

In T198049#4310346, @jcrespo wrote:
51,715 exceptions with:

[{exception_id}] {exception_url} Wikimedia\Rdbms\DBReplicationWaitError from line 426 of /srv/mediawiki/php-1.32.0-wmf.8/includes/libs/rdbms/lbfactory/LBFactory.php: Could not wait for replica DBs to catch up to db1071


These exceptions are from maintenance scripts. Prior to December 2015 the policy was for maintenance scripts to just wait forever in the case of replication lag, but since fedfee628c377eeea0453ed82af02b6878bd525b it throws an exception after a hard-coded 60 seconds. According to the commit message this "makes failure more explicit". I suppose it also allows a job runner to reconfigure itself if a slave is dropped.

The reason for the prior policy was because at the time, lag was very common, scripts were not always restartable.

Unfortunately it does not help to isolate the problem since exception spam is now the inevitable and harmless consequence of >60s lag.TASK DETAILhttps://phabricator.wikimedia.org/T198049EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: tstarlingCc: tstarling, jcrespo, hoo, Lydia_Pintscher, daniel, Ladsgroup, Marostegui, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T198049: Investigate possible outage on wikidata on 25th June - 04:13AM UTC - 05:27AM UTC

2018-06-25 Thread jcrespo
jcrespo added a parent task: Restricted Task.
TASK DETAILhttps://phabricator.wikimedia.org/T198049EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: jcrespoCc: Aklapper, hoo, Lydia_Pintscher, daniel, Ladsgroup, Marostegui, AndyTan, Davinaclare77, Qtn1293, Lahi, Gq86, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, LawExplorer, Zppix, Wong128hk, Wikidata-bugs, aude, faidon, Mbch331, Jay8g, fgiunchedi___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs