[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26

BBlack Sat, 28 Oct 2017 15:06:07 -0700

BBlack added a comment.

A while after the above, @hoo started focusing on a different aspect of this we've been somewhat ignoring as more of a side-symptom: that there tend to be a lot of sockets in a strange state on the "target" varnish, to various MW nodes. They look strange on both sides, in that they spend significant time in the CLOSE_WAIT state on the varnish side and FIN_WAIT_2 on the MW side. This is a consistent state between the two nodes, but it's not usually one that non-buggy application code spends much time in. In this state, the MW side has sent FIN, Varnish has seen that and sent FIN+ACK, but Varnish has not yet decided to send its own FIN to finish the active closing process, and MW is still waiting on it.

While staring at the relevant Varnish code to figure out why or how it would delay closing in this case, it seemed like it was possible in certain cases related to connections in the VBC_STOLEN state. Instead of closing immediately, in some such cases it defers killing the socket until some future eventloop event fires, which could explain the closing-delays under heavy load (and we know Varnish is backlogged in some senses when the problem is going on, because mailbox lag rises indefinitely). All of that aside, at some point while staring at related code I realized that do_stream behaviors can influence some related things as well, and we had a recent related VCL patch....

The patch in question was https://gerrit.wikimedia.org/r/#/c/386616/ , which was merged around 14:13 Oct 26, about 4.5 hours before the problems were first noticed (!).

I manually reverted the patch on cp1067 (current target problem node) as a test, and all of the CLOSE_WAIT sockets disappeared shortly, never to return. I reverted the whole patch through gerrit shortly afterwards, so that's permanent now across the board.

I think there's a strong chance this patch was the catalyst for start of the problems. At the very least, it was exacerbating the impact of the problems. If it turns out to be the problem, I think we still have more post-mortem investigation to do here, because the issues that raises are tricky. If it's just exacerbating, I think it's still useful to think about why it would, because that may help pin down the real problem.

Operating on the assumption that it's the catalyst and diving a little deeper on that angle:

The patch simply turned off do_stream behavior when the backend-most Varnish was talking to application layer services, when the applayer response did not contain a Content-Length header. Turning off do_stream makes Varnish act in a store-and-forward mode for the whole response, rather than forwarding bytes onwards to upstream clients as they arrive from the application. The benefit we were aiming for there was to have Varnish calculate the value of the missing Content-Length so that we can make more-informed cache-tuning decisions at higher layers. Minor performance tradeoffs aside, turning off do_stream shouldn't be harmful to any of our HTTP transactions under "reasonable" assumptions (more later on what "reasonable" is here). In fact, that was the default/only mode our caches operated in back when we were running Varnish 3, but streaming became the default for the text cluster when it switched to Varnish 4 just under a year ago. So this was "OK" a year ago, but clearly isn't ok for some requests today.

That there was always a singular chash target within the text cluster for the problems also resonates here: there's probably only one special URI out there which breaks the "reasonable" assumption. Another oddity that we didn't delve into much before was that when we restarted the problematic varnish, it only depooled for a short period (<1 min), yet the problem would move *permanently* to its next chash target node and stay there even after the previous target node was repooled. This might indicate that the clients making these requests are doing so over very-long-lived connections, and even that the request->response cycle itself must be very-long-lived.... It moves via re-chashing when a backend is restarted, but doesn't move on repool because everything's still connected and transacting...

My best hypothesis for the "unreasonable" behavior that would break under do_stream=false is that we have some URI which is abusing HTTP chunked responses to stream an indefinite response. Sort of like websockets, but using the normal HTTP protocol primitives. Client sends a request for "give me a live stream of some events or whatever", and the server periodically sends new HTTP response chunks to the client containing new bits of the event feed. Varnish has no way to distinguish this behavior from normal chunked HTTP (where the response chunks will eventually reach a natural end in a reasonable timeframe), and in the do_stream=false store-and-forward mode, Varnish would consume this chunk stream into its own memory buffers indefinitely, waiting for the stream to end before it can forward the whole thing to the client. This behavior would line up with a lot of the strange stats indicators we've seen in Varnish recently (both during this problem, and at other earlier points in time).

Does anyone know of something we've added in the past ~1y that has this behavior through the text caches? I don't recall having heard about it! Later (not now, as I have a weekend to get back to!) I'll do some digging to find such traffic patterns and empirically nail it down, but it might save a lot of trouble if someone already knows what this is.

TASK DETAIL

https://phabricator.wikimedia.org/T179156

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: BBlack
Cc: Smalyshev, TerraCodes, Jay8g, Liuxinyu970226, Paladox, Zppix, Stashbot, gerritbot, thiemowmde, aude, Marostegui, Lucas_Werkmeister_WMDE, Legoktm, tstarling, awight, Ladsgroup, Lydia_Pintscher, ori, BBlack, demon, greg, Aklapper, hoo, Lahi, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Hfbn0, Ramalepe, Liugev6, QZanden, Lewizho99, Maathavan, Mkdw, Liudvikas, srodlund, Luke081515, Wikidata-bugs, ArielGlenn, faidon, zeljkofilipin, Alchimista, He7d3r, Mbch331, Rxy, fgiunchedi, mmodell

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T179156: 503 spikes and resulting API slowness starting 18:45 October 26

Reply via email to