The latency issue we've all been observing over the last week is
ongoing this morning. The issue is being handled as a high-priority
incident with company-wide visibility.

Some advice based on observations made over the last three peak time periods:

* Ensure that all of your requests have some type of timeout set. Some
requests never complete, and hung requests could tie up your resources
or block other work.

* Retry requests that time out.

* Short timeouts with aggressive retry are probably
counter-productive. You'll just keep going to the back of the line.
Bias towards longer timeouts, but still guard against the possibility
that the request will never return by timing out at a reasonable point
and retrying.

* Latency varies considerably and continuously. I've observed long
periods of 2-3 second response times punctuated by periods of 10 to 30
second latency.

* Most requests connect immediately (within say 250ms), but a very
small proportion take up to 20 seconds to form a TCP connection.
Results are subsequently returned consistent with adding the
contemporaneously observed response latency on immediately connecting
requests. This situation is very rare.

* Until this issue is resolved, I'd suggest a timeout of 20 seconds
for most request types during the daily peak hours of 13:30 to 18:30
UTC. My externally observed latency histogram from yesterday morning
follows. The first column is grouped-by-tens seconds. 00 = 0 to 9
seconds, 10 = 10 to 19 seconds, 20 = 20 to 29 seconds, etc. The second
column is the proportion.

  00    0.8102
  10    0.1079
  20    0.0488
  30    0.0289
  40    0.0022
  50    0.0018

In this sample, 91.81% of requests were returned in 19 seconds, and
99.58% of requests were returned within 29 seconds.

-John Kalucki
http://twitter.com/jkalucki
Infrastructure, Twitter Inc.

Reply via email to