Hi, We've been running Storm for a while, and recently decided to upgrade from 0.8 to 0.9.0.1. We managed to get our staging infrastructure (one 'master' node running nimbus, drpc, ui, zookeeper, one 'worker' node running the supervisor) working, and moved onto upgrading production (this being a cluster of three nodes: one 'master' node, and two 'worker' nodes). However, our production cluster refused to run a topology correctly - Trident batches would be formed from incoming events, then would be failed due to timing out, without being processed. After some experimentation, we managed to replicate this on our staging cluster, by introducing a second 'worker' node.
We're completely stumped on this, as there are no errors in the logs, and no differences that we can detect between the existing (working) 'worker' node and our freshly built 'worker' node. We've tried various different arrangements of 'worker' nodes in the cluster, and found the following behaviour: 1. New cluster with master node and worker node, worker1: correct behaviour. Batches processed. 2. Cluster with master node with worker1 disabled, and a new worker node, worker2, added: all batches time out. 3. Cluster with master node, and worker1 and worker2 running: all batches time out on both worker1 and worker2 4. Cluster with master node, and worker2 disabled, worker1 enabled: correct behaviour. Batches processed. In addition to these three isolated scenarios, we've tried stopping and starting the supervisor process on each of the workers to transition between each and every one of the above scenarios. It is always the case that as soon as worker1 is not the only node running the supervisor process, the topology starts failing. Even with just worker2 running the supervisor, the topology fails to work. As soon as the worker1 becomes the only active worker node within the cluster, regardless of previous arrangements, everything starts running OK. During the failure state, there is no indication in the logs of anything amiss - no stack traces, no apparent errors. The only output is from our spout logging that batches are created, and again once they time out. It's worth noting at this point that we use Puppet to manage our node configuration, and our nodes are AWS instances. Hence when I say that the worker2 is identical to worker1, I mean that they were bootstrapped from the same config, and thus should be identical. I can also discount a single 'dodgy' host being responsible for the new worker node's behaviour - I've been fighting this for over a week, and have in that time bootstrapped and terminated various new nodes. All have had identical behaviour.
