I have set up Storm nimbus HA on a small cluster, and have been experiencing 
severe performance problems when simulating a significant server outage.  In 
the test, I turn off half of the Storm nodes (including the nimbus leader), 
which is a realistic concern in our environment.  What I observe is that the 
Nimbus leader moves to a new node correctly, however ALL Storm operation ceases 
to be responsive in a timely manner.  All of the Storm HTTP API calls time out, 
or complete in 1-2 minutes, the Storm UI takes forever to load, and the 
topologies cannot spin up workers in a timely manner, which leads to topology 
downtime.

In the logs I see a lot of messages about failure to connect to the nodes that 
I've manually turned off.  These errors persist for hours.

I can "fix" the issue by manually removing the downed nodes from the 
nimbus.seeds setting in Storm.yaml and restarting the Storm services on the 
live machines, however this manual intervention feels unnecessary and is not 
practical in a production environment when a failure may occur when no one is 
around to babysit a manual recovery process.

I have a small environment set up to mimic our production system, and this 
behavior is easy to duplicate with four Storm nodes, when two are turned off.

Any help would be appreciated.

Reply via email to