We're seeing this issue on our Storm worker servers where server after server fails in a particular way. The servers either stop establishing new outgoing connections at all, or the outgoing connections are very, very slow. The servers aren't able to recover from the state without a reboot.
The servers start exhibiting this behavior in a pretty predictable way. The servers are going "out" roughly in the order of most cumulative events volume, i.e. the servers that have been up the longest processing events for topologies with the most amount of events are going out one after the other. In our hostname naming scheme we've gone through servers a through f in pretty much alphabetical order with slight variance depending on which topologies were running on those servers. We're currently using Rackspace cloud servers, which obviously aren't really optimal for Storm use, but that's what we're using right now. We're using CentOS release 6.5 (Final) (2.6.32-431.3.1.el6.x86_64) on those servers. Has anyone seen this sort of behavior on your infrastructure? Also since I can't really just move the servers to another server infrastructure in an instant, I'll probably have to deal with this issue for some time, does anyone have any good suggestions as to how to deal with this? Storm, natively, doesn't seem to have any way to deal with a situation like this. The workers appear live to Storm, they're just processing events VERY slowly, or not at all. Is there a good way to (semi-)automatically detect slow bolts/workers and proactively kill them off without having to rely on passive monitoring (like Storm-UI, or some other monitoring solution). My optimal solution would be to "route" events around the slow components. The more automatic the process, the better. -TPP
