We're seeing this issue on our Storm worker servers where server after
server fails in a particular way. The servers either stop establishing
new outgoing connections at all, or the outgoing connections are very,
very slow. The servers aren't able to recover from the state without a
reboot.

The servers start exhibiting this behavior in a pretty predictable
way. The servers are going "out" roughly in the order of most
cumulative events volume, i.e. the servers that have been up the
longest processing events for topologies with the most amount of
events are going out one after the other. In our hostname naming
scheme we've gone through servers a through f in pretty much
alphabetical order with slight variance depending on which topologies
were running on those servers.

We're currently using Rackspace cloud servers, which obviously aren't
really optimal for Storm use, but that's what we're using right now.
We're using CentOS release 6.5 (Final) (2.6.32-431.3.1.el6.x86_64) on
those servers.

Has anyone seen this sort of behavior on your infrastructure?

Also since I can't really just move the servers to another server
infrastructure in an instant, I'll probably have to deal with this
issue for some time, does anyone have any good suggestions as to how
to deal with this?

Storm, natively, doesn't seem to have any way to deal with a situation
like this. The workers appear live to Storm, they're just processing
events VERY slowly, or not at all.

Is there a good way to (semi-)automatically detect slow bolts/workers
and proactively kill them off without having to rely on passive
monitoring (like Storm-UI, or some other monitoring solution). My
optimal solution would be to "route" events around the slow
components. The more automatic the process, the better.

-TPP

Reply via email to