https://bugzilla.wikimedia.org/show_bug.cgi?id=66989

--- Comment #4 from Ori Livneh <[email protected]> ---
Merlijn van Deen offered to look into this with me and we were able to identify
the problem: the WebSocket handshake requires two round-trips to the server,
and the load balancers were configured to distribute incoming requests across
backends in a round-robin fashion. Because the requests that make up the
initial handshake follow each other in quick succession, the most common case
was for one request to be routed to one server, and the follow-up request to be
routed to another server, which had not started negotiating a session with the
client and was therefore not expecting the request.

This also explains why it sometimes worked: if another client request
intervened between the two requests, you'd get routed to the same server and
the handshake would succeed.

Giuseppe and I decided to temporarily "fix" this by simply shutting down one of
the servers, causing all requests to get routed to the single remaining server.
This made the errors go away, validating the diagnosis. The more permanent fix
is to use a different scheduling algorithm to make sessions sticky. This is
implemented in <https://gerrit.wikimedia.org/r/#/c/152960/>, which will be
deployed in the next few days, most likely.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to