Just noticed that we have a cluster config "MAX_OFFLINE_INSTANCES_ALLOWED", which is used in https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java#L70-L71
"If the offline/disabled instance number is above this threshold, the rebalancer will be paused." I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage? Will it help us with the case when a large portion or even the whole cluster disconnect to zk? On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu <[email protected]> wrote: > I agree semi-auto is a safer mode for stateful service. But we will have > to compute ideal state by ourselves (either manually triggered or triggered > by live instance change events). That means we need to implement logic for > delayed shard move and a shard placement algorithm. Not sure if there is > any building blocks exposed by Helix that we could leverage for semi-auto > mode. > > On Tue, Mar 6, 2018 at 7:12 PM, kishore g <[email protected]> wrote: > >> This was one of the reasons we came up with the semi-auto mode. It's >> non-trivial to handle edge cases in full auto mode, especially for stateful >> services. Having said that, let's see what we can do in >> catastrophic scenarios. Having a check on the live instances changes is a >> good check but its hard to compute this reliably in some scenarios - for >> e.g. lets controllers also went down at the same time and came up back, >> they would have missed all the changes from ZK. >> >> I think it's better to limit the number of changes a controller would >> trigger in the cluster. This is where throttling and constraints can be >> used. Helix already has the ability limit the number of transitions in the >> cluster at once. But this limits the number of concurrent transitions not >> the number of transitions triggered in a time period. >> >> We can probably enhance this functionality to keep track of the number of >> transitions in last X minutes and limit that number. >> >> Any thoughts on that? >> >> >> >> >> >> >> >> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu <[email protected]> wrote: >> >>> Hi, >>> >>> We are using delayed rebalancer to manage a Master-Slave cluster. >>> In the event when a large portion of a cluster disconnect from ZK >>> (network partition, or service crash due to a bug), helix controller will >>> try hard to move shards to the rest of the cluster. >>> This could make the thing worse if it's very expensive to rebuild a >>> replica or there is no live replica left in the rest of the cluster. >>> I am wondering what's the suggested way to handle this case? Is there a >>> way to let Helix controller pause when the change of live instances is more >>> than a threshold? >>> >>> -- >>> Best regards, >>> Bo >>> >>> >> > > > -- > Best regards, > Bo > > -- Best regards, Bo
