I have seen this behaviour too using 0.9.2-incubating.
The failover works better when there is a redundant node available. Maybe 1
slot per node is the best approach.
Eager to know if there are any steps to further diagnose.




On Wed, Aug 6, 2014 at 5:43 AM, Vinay Pothnis <[email protected]>
wrote:

> [Storm Version: 0.9.2-incubating]
>
> Hello,
>
> I am trying to test failover scenarios with my storm cluster. The
> following are the details of the cluster:
>
> * 4 nodes
> * Each node with 2 slots
> * Topology with around 600 spouts and bolts
> * Num. Workers for the topology = 4
>
> I am running a test that generating a constant load. The cluster is able
> to handle this load fairly well and the CPU utilization at this point is
> below 50% on all the nodes. 1 slot is occupied on each of the nodes.
>
> I then bring down one of the nodes (kill the supervisor and the worker
> processes on a node). After this, another worker is created on one of the
> remaining nodes. But the CPU utilization jumps up to 100%. At this point,
> nimbus cannot communicate with the supervisor on the node and keeps killing
> and restarting workers.
>
> The CPU utilization remains pegged at 100% as long as the load is on. If I
> stop the tests and restart the test after a while, the same set up with
> just 3 nodes works perfectly fine with less CPU utilization.
>
> Any pointers to how to figure out the reason for the high CPU utilization
> during the failover?
>
> Thanks
> Vinay
>
>
>

Reply via email to