I have set up Storm nimbus HA on a small cluster, and have been experiencing severe performance problems when simulating a significant server outage. In the test, I turn off half of the Storm nodes (including the nimbus leader), which is a realistic concern in our environment. What I observe is that the Nimbus leader moves to a new node correctly, however ALL Storm operation ceases to be responsive in a timely manner. All of the Storm HTTP API calls time out, or complete in 1-2 minutes, the Storm UI takes forever to load, and the topologies cannot spin up workers in a timely manner, which leads to topology downtime.
In the logs I see a lot of messages about failure to connect to the nodes that I've manually turned off. These errors persist for hours. I can "fix" the issue by manually removing the downed nodes from the nimbus.seeds setting in Storm.yaml and restarting the Storm services on the live machines, however this manual intervention feels unnecessary and is not practical in a production environment when a failure may occur when no one is around to babysit a manual recovery process. I have a small environment set up to mimic our production system, and this behavior is easy to duplicate with four Storm nodes, when two are turned off. Any help would be appreciated.