I have seen this behaviour too using 0.9.2-incubating. The failover works better when there is a redundant node available. Maybe 1 slot per node is the best approach. Eager to know if there are any steps to further diagnose.
On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <[email protected]> wrote: > [Storm Version: 0.9.2-incubating] > > Hello, > > I am trying to test failover scenarios with my storm cluster. The > following are the details of the cluster: > > * 4 nodes > * Each node with 2 slots > * Topology with around 600 spouts and bolts > * Num. Workers for the topology = 4 > > I am running a test that generating a constant load. The cluster is able > to handle this load fairly well and the CPU utilization at this point is > below 50% on all the nodes. 1 slot is occupied on each of the nodes. > > I then bring down one of the nodes (kill the supervisor and the worker > processes on a node). After this, another worker is created on one of the > remaining nodes. But the CPU utilization jumps up to 100%. At this point, > nimbus cannot communicate with the supervisor on the node and keeps killing > and restarting workers. > > The CPU utilization remains pegged at 100% as long as the load is on. If I > stop the tests and restart the test after a while, the same set up with > just 3 nodes works perfectly fine with less CPU utilization. > > Any pointers to how to figure out the reason for the high CPU utilization > during the failover? > > Thanks > Vinay > > >
