Network operations believes that they found and fixed the proximate cause of the connection abandonment issue -- a periodically overwhelmed LB CPU. The LB should close connections in this case, but for some reason it wasn't closing them over the last two weeks. Load will be more carefully managed on this particular LB pair to avoid this problem in the future. If you see abandoned connections, let's dig into the issue, but, for now, I think the system is in a good state.
-John Kalucki http://twitter.com/jkalucki Infrastructure, Twitter Inc. On Tue, Feb 23, 2010 at 9:37 AM, John Kalucki <j...@twitter.com> wrote: > Judging my some additional data found yesterday, this drop apparently > happens most often at around say 14:30 and 16:00 UTC, a time period that we > also happen to steeply climb into our daily peak traffic. By our monitoring, > we have not experienced a connection drop so far this morning. But, I have > no confidence that all streams are dropped during all events, and it's > possible that our monitoring streams were just lucky -- and any true client > drops were lost in the organic connection churn noise. > > If you have data to the contrary between say 14:00 UTC and 18:00 UTC today, > please let us know. Otherwise, we're going to keep watching and waiting for > this to happen again. Once we have a drop, we have a team of networking > engineers at the ready to run through a pre-planned sequence of > investigatory steps. With any luck, we'll identify the issue. > > -John Kalucki > > http://twitter.com/jkalucki > Infrastructure, Twitter Inc. > > > > On Mon, Feb 22, 2010 at 4:30 PM, Sergi <sdepab...@gmail.com> wrote: > >> I experienced the problem for the last time today - in fact now >> yesterday - at 15:55 and after 5 minutes Phirehose reconnected. >> >> [22-Feb-2010 15:55:25] Phirehose: Consume rate: 0 status/sec (1 >> total), avg enqueueStatus(): 0.05ms, avg checkFilterPredicates(): >> 0.01ms (3 total) over 60 seconds. >> [22-Feb-2010 16:01:22] Phirehose: Idle timeout: No statuses received >> for > 300 seconds. Reconnecting. >> >> >> Sergi >> >> On Feb 22, 7:51 pm, John Kalucki <j...@twitter.com> wrote: >> > A number of developers have reported abandoned connection issues on the >> > Streaming API starting, perhaps, about two weeks ago. The symptoms >> include a >> > long-established TCP connection to stream.twitter.com going quiet, with >> the >> > connection mysteriously held open for perhaps hours afterward. After >> sorting >> > through a lot of conflicting data and chasing a few wild geese, I >> finally >> > reproduced this problem at Feb 22 15:55 UTC (7:55am PST). I'd imagine >> that a >> > number of streams were abandoned at this time. If you had a correlative >> > experience within a minute or so of 15:55 UTC, please respond to this >> > message. >> > >> > We currently suspect an infrequent hardware load balancer issue, perhaps >> > related to a recent configuration change. The appearance is that the >> load >> > balancer is, for whatever reason, dropping valid connections, closing >> the >> > connection to the Streaming API servers, but not sending a TCP FIN or >> TCP >> > RST to the client. This is bad. We're treating this as a critical >> production >> > issue and working through the details with network operations. I'll >> follow >> > up as we learn more. >> > >> > -John Kaluckihttp://twitter.com/jkalucki >> > Infrastructure, Twitter Inc. >> > >