I am trying to understand why for a topology I am trying to run on 0.9.1-incubating, the supervisor on the machine is killing *all* of the topology's Storm workers periodically.
Whether I use topology.workers=1,2,4, or 8, I always get logs like this: https://gist.github.com/amontalenti/cd7f380f716f1fd17e1b Which basically indicates that the supervisor thinks all the workers timed out at exactly the same time, and then it kills them all. I've tried tweaking the worker timeout seconds, bumping it up to e.g. 120 secs, but this hasn't helped at all. No matter what, periodically, the workers just get whacked by the supervisor and the whole topology has to restart. I notice that this does happen less frequently if the machine is under less load, e.g. if I drop topology.max.spout.pending *way* down, to e.g. 100 or 200, then it runs for awhile without crashing. But I've even seen it crash in this state. I saw on some other threads that people indicated that the supervisor will kill all workers if "the nimbus fails to see a heartbeat from zookeeper". Could someone walk me through how I could figure out if this is the case? Nothing in the logs seems to point me in this direction. Thanks! Andrew
