[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26

BBlack Fri, 27 Oct 2017 05:34:55 -0700

BBlack added a comment.

Copying this in from etherpad (this is less awful than 6 hours of raw IRC+SAL logs, but still pretty verbose):

# cache servers work ongoing here, ethtool changes that require short depooled downtimes around short ethernet port outages:
17:49 bblack: ulsfo cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
17:57 bblack@neodymium: conftool action : set/pooled=no; selector: name=cp4024.ulsfo.wmnet
17:59 bblack: codfw cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
18:00 <+jouncebot> Amir1: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will 
                   be rewarded with a sticker.
18:27 bblack: esams cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause
18:41 bblack: eqiad cp servers: rolling quick depool -> repool around ethtool parameter changes for -lro,-pause


# 5xx alerts start appearing.  initial assumption is related to ethtool work above
18:44 <+icinga-wm> PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0]
18:46 <+icinga-wm> PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
18:47 <+icinga-wm> PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0]
18:48 <+icinga-wm> PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0]
# ...but once the MW exceptions hit, seems less-likely to be related to the ethtool work

# notices hit IRC for these wikidata sorting changes:
18:51 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 01m 04s)
18:52 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/Wikidata/extensions/Constraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 49s)
19:12 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/tests/phpunit/DelegatingConstraintCheckerTest.php: Fix sorting of NullResults (T179038) (duration: 00m 50s)
19:14 ladsgroup@tin: Synchronized php-1.31.0-wmf.5/extensions/WikibaseQualityConstraints/includes/ConstraintCheck/DelegatingConstraintChecker.php: Fix sorting of NullResults (T179038) (duration: 00m 49s)

# Lots of discussion and digging ensues on all sides...
# bblack figures out that while the logs implicate a single eqiad text backend cache, depooling said cache moves the problem to a different cache host (repeatedly), so it doesn't seem to be a faulty cache node.
# One cache just happens to be the unlucky chash destination for more of the problematic traffic than the others at any given time.
# The problematic traffic load/patterns consumes all of the 1K connection slots varnish allows to api.svc+appservers.svc, and then this causes many unrelated 503s for lack of available backend connection slots to service requests.

# The Fatals logs seem to be related to ORES fetching from Wikidata
# So, a timeout is increased there to cope with slow wikidata responses:

19:33 awight@tin: Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107
19:33 awight@tin: Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 00m 10s)
19:34 awight@tin: Started deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107
19:36 aaron@tin: Started restart [jobrunner/jobrunner@a20d043]: (no justification provided)
19:41 awight@tin: Finished deploy [ores/deploy@0adae70]: Increase extractor wikidata API timeout to 15s, T179107 (duration: 07m 25s)

# Still doesn't fix the problem, so the next attempted fix is to disable ores+wikidata entirely:

20:02 ladsgroup@tin: Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s)
20:00 ladsgroup@tin: Synchronized wmf-config/InitialiseSettings.php: UBN! disbale ores for wikidata (T179107) (duration: 00m 50s)

# Things are still borked, try reverting some other recent Wikidata-related changes:

20:59 hoo@tin: Synchronized wmf-config/Wikibase.php: Revert "Add property for RDF mapping of external identifiers for Wikidata" (T178180) (duration: 00m 50s)
21:00 hoo: Fully revert all changes related to T178180

# Still borked.  Tried reverting something else that looks dangerous in the logstash errors, but also wasn't the cause:

21:30 hoo@tin: Synchronized wmf-config/InitialiseSettings.php: Temporary disable remex html (T178632) (duration: 00m 50s)
21:32 hoo@tin: Synchronized wmf-config/InitialiseSettings.php: Temporary disable remex html (T178632) (duration: 00m 50s)

# For lack of anything better to try, raise varnish->MW connection limits by an order of magnitude to see what happens:

21:40 bblack: raising backend max_connections for api.svc.eqiad.wmnet + appservers.svc.eqiad.wmnet from 1K to 10K on cp1053.eqiad.wmnet (current funnel for the bulk of the 503s)

# At this point the large rate of unrelated 503s subsides.  Socket counts eventually stabilize without reaching the new (10K) limits on this host.  There's still clearly a problem, but this is effectively mitigating the worst of the damage for now.
# Eventually puppetize this change out to the other cache_text in eqiad and restart cp1053 backend to clear mailbox lag from the heavy connection-cycling, which moves the traffic pattern over to cp1052, where the same bad connection pattern appears yet again (but now without unrelated 503s, of course):

22:34 <+wikibugs> (CR) BBlack: [C: 2] cache_text: raise MW connection limits to 10K [puppet] - https://gerrit.wikimedia.org/r/386756 (owner: BBlack)
22:39 bblack: restarting varnish-backend on cp1053 (mailbox lag from ongoing issues elsewhere?)
22:41 < bblack> so the target cache is now cp1052 it seems

# ori chimes in out of nowhere that we should try reverting the earlier "Fix sorting of NullResults" stuff, which then leads to a decision to just roll Wikidatawiki back to wmf.4 (which undoes those changes and more):

22:47 demon@tin: rebuilt wikiversions.php and synchronized wikiversions files: wikidata to wmf.4

# In reference to how quickly this might work even if it's the right revert (finally?):

22:49 < ori> it'll probably take a few minutes

# Keep debating and staring at things.  We're still not having unrelated 503s since back at ~21:40, but the pattern (now on cp1052) of the underlying badness is still present (large counts ofsockets cycling to MW).  Eventually:

23:22 < bblack> I'm going to restart cp1052 varnish backend again just to clear out other noise that makes it harder to see what's going on
23:24 < bblack> cp1067 is the new target
[...]
23:29 < bblack> well, cp1067 isn't building up the connection stats I'd expect
23:29 < bblack> none of them seem to clearly be the target anymore...
23:30 < bblack> problem mysteriously poofed? probably not right when I moved the traffic above.  probably earlier, but varnish was still suffering until I moved it.
23:33 < bblack> yeah, still none of the eqiad text caches are showing signs of connection problems.
23:33 < bblack> the last action I took was restarting cp1052, which would've (at least, the several previous times I did this) shifting the load to a random new cache
23:34 < bblack> it was ~22:40 (nearly an hour ago) when we switched from cp1053 to cp1052
23:35 < bblack> and it was during the earlier part of that nearly-an-hour window that the wmf.4 revert for wikidatawiki happened
23:35 < bblack> I tihnk somewhere in there the underlying problem went away, and cp1052 was just still suffering lingering effects.
23:35 < bblack> that's my best hypothesis right now
[...]
23:36 < bblack> but yeah, I think the revert did it, eventually
23:37 < bblack> [plus kicking the tires of the varnish cluster to shake it out of its bad pattern afterwards]

TASK DETAIL

https://phabricator.wikimedia.org/T179156

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BBlack
Cc: thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, GoranSMilovanovic, Th3d3v1ls, Hfbn0, QZanden, Zppix, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, Jay8g, fgiunchedi, mmodell

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26

Reply via email to