We are using delayed rebalancer to manage a Master-Slave cluster.
In the event when a large portion of a cluster disconnect from ZK (network
partition, or service crash due to a bug), helix controller will try hard
to move shards to the rest of the cluster.
This could make the thing worse if it's very expensive to rebuild a replica
or there is no live replica left in the rest of the cluster.
I am wondering what's the suggested way to handle this case? Is there a way
to let Helix controller pause when the change of live instances is more
than a threshold?