I am trying to understand why for a topology I am trying to run on
0.9.1-incubating, the supervisor on the machine is killing *all* of the
topology's Storm workers periodically.

Whether I use topology.workers=1,2,4, or 8, I always get logs like this:

https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b

Which basically indicates that the supervisor thinks all the workers timed
out at exactly the same time, and then it kills them all.

I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120
secs, but this hasn't helped at all. No matter what, periodically, the
workers just get whacked by the supervisor and the whole topology has to
restart.

I notice that this does happen less frequently if the machine is under less
load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or
200, then it runs for awhile without crashing. But I've even seen it crash
in this state.

I saw on some other threads that people indicated that the supervisor will
kill all workers if "the nimbus fails to see a heartbeat from zookeeper".
Could someone walk me through how I could figure out if this is the case?
Nothing in the logs seems to point me in this direction.

Thanks!

Andrew

Reply via email to