The latency issue we've all been observing over the last week is ongoing this morning. The issue is being handled as a high-priority incident with company-wide visibility.
Some advice based on observations made over the last three peak time periods: * Ensure that all of your requests have some type of timeout set. Some requests never complete, and hung requests could tie up your resources or block other work. * Retry requests that time out. * Short timeouts with aggressive retry are probably counter-productive. You'll just keep going to the back of the line. Bias towards longer timeouts, but still guard against the possibility that the request will never return by timing out at a reasonable point and retrying. * Latency varies considerably and continuously. I've observed long periods of 2-3 second response times punctuated by periods of 10 to 30 second latency. * Most requests connect immediately (within say 250ms), but a very small proportion take up to 20 seconds to form a TCP connection. Results are subsequently returned consistent with adding the contemporaneously observed response latency on immediately connecting requests. This situation is very rare. * Until this issue is resolved, I'd suggest a timeout of 20 seconds for most request types during the daily peak hours of 13:30 to 18:30 UTC. My externally observed latency histogram from yesterday morning follows. The first column is grouped-by-tens seconds. 00 = 0 to 9 seconds, 10 = 10 to 19 seconds, 20 = 20 to 29 seconds, etc. The second column is the proportion. 00 0.8102 10 0.1079 20 0.0488 30 0.0289 40 0.0022 50 0.0018 In this sample, 91.81% of requests were returned in 19 seconds, and 99.58% of requests were returned within 29 seconds. -John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc.