protect a cluster during broad range outage

Bo Liu Tue, 06 Mar 2018 16:30:46 -0800

Hi,

We are using delayed rebalancer to manage a Master-Slave cluster.
In the event when a large portion of a cluster disconnect from ZK (network
partition, or service crash due to a bug), helix controller will try hard
to move shards to the rest of the cluster.
This could make the thing worse if it's very expensive to rebuild a replica
or there is no live replica left in the rest of the cluster.
I am wondering what's the suggested way to handle this case? Is there a way
to let Helix controller pause when the change of live instances is more
than a threshold?


-- 
Best regards,
Bo

protect a cluster during broad range outage

Reply via email to