BBlack lowered the priority of this task from "High" to "Normal".
BBlack changed the task status from "Open" to "Stalled".
BBlack added a comment.

The timeout changes above will offer some insulation, and as time passes we're not seeing evidence of this problem recurring with the do_stream=false patch reverted.

Some related investigations on slow requests have turned up some pointers 120-240s timeouts on requests to the REST API at /api/rest_v1/transform/wikitext/to/html. which are eerily similar to the kinds of problems we saw a while back in T150247 . RB was dropping the connection from Varnish, and doing so in a way that Varnish would retry it indefinitely internally. We patched Varnish to mitigate that particular problem in the past, but something related may be surfacing here...

We have a few steps to go here, but there's going to be considerable delays before we get to the end of all of this:

  • We have a preliminary patch to Varnish to limit the total response transaction time on backend requests (if the backend is dribbling response bytes often enough to evade hitting the between_bytes_timeout) at https://gerrit.wikimedia.org/r/#/c/387236/ . However, the patch is built on Varnish v5, and cache_text currently runs Varnish v4. We weren't planning to do any more Varnish v4 releases before moving all the clusters to v5 unless an emergency arose, as it complicates our process and timelines considerably, and this isn't enough of an emergency to justify it. Therefore, this part is blocked on https://phabricator.wikimedia.org/T168529 .
  • We want to log slow backend queries so that we have a better handle on these cases in general. There's ongoing work for this in https://gerrit.wikimedia.org/r/#/c/389515/ , https://gerrit.wikimedia.org/r/#/c/389516 , and more to come. One of those patches also has the v4/v5 issues above and blocks on upgrading cache_text to v5.
  • With those measures in place, we should be able to definitively identify (and/or workaround) the problematic transactions and figure out what needs fixing at the application layer, at which point we can un-revert the do_stream=false and move forward with our other VCL plans around exp(-size/c) admission policies on cache_text frontends as part of T144187 (but none of this ties up doing the same on cache_upload).

TASK DETAIL
https://phabricator.wikimedia.org/T179156

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BBlack
Cc: daniel, Peachey88, ema, Gehel, Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to