Great, thanks a lot! will try it out tomorrow. On Mon, Mar 19, 2018 at 7:30 PM, Lei Xia <[email protected]> wrote:
> Here is the commit that adds maintenance mode support if you are > interested in the implementation: https://github.com/apache/helix/commit/ > a7477c3bbc85059b2e522f5caa214c33eb4c3e15 > > On Mon, Mar 19, 2018 at 7:08 PM, Bo Liu <[email protected]> wrote: > >> Hi Lei, >> >> I only saw the code that set the maintenance flag. However, I don't see >> any code that reads this flag. Would you point me to the code that >> implements the maintenance mode logic? >> >> Thanks, >> Bo >> >> On Mon, Mar 19, 2018, 18:48 kishore g <[email protected]> wrote: >> >>> Hi Lei, >>> >>> qq. What if the cluster was getting started for the first time. Will it >>> get enabled only after min nodes are started? >>> >>> thanks >>> >>> On Mon, Mar 19, 2018 at 6:42 PM, Lei Xia <[email protected]> wrote: >>> >>>> Actually we already supported maintenance mode in 0.8.0. My bad. >>>> >>>> >>>> You can give it a try now, with "MAX_OFFLINE_INSTANCES_ALLOWED" set >>>> in ClusterConfig, once the # of offline instance reaches to the limit, >>>> Helix will put the cluster into maintenance mode. >>>> >>>> >>>> For now, you have to call HelixAdmin.enableMaintenanceMode() manually >>>> to exit the maintenance mode. Support of auto existing maintenance mode is >>>> on our road-map. >>>> >>>> >>>> >>>> >>>> >>>> Lei >>>> >>>> >>>> >>>> >>>> *Lei Xia* >>>> >>>> >>>> Data Infra/Helix >>>> >>>> [email protected] >>>> www.linkedin.com/in/lxia1 >>>> ------------------------------ >>>> *From:* Bo Liu <[email protected]> >>>> *Sent:* Monday, March 19, 2018 6:33:10 PM >>>> *To:* [email protected] >>>> *Subject:* Re: protect a cluster during broad range outage >>>> >>>> Hi Lei, >>>> >>>> Thank you so much for the detailed information. >>>> We were almost about to implement our own logic to generate ideal state >>>> to handle this disaster case (That's why we were looking at the code in >>>> BestPossibleStateCalcStage.java). >>>> Are you saying the pausing mode is already implemented in 0.8.0? >>>> I looked at the code in validateOfflineInstancesLimit(), which only >>>> set the maintenance flag not the pause flag when >>>> MAX_OFFLINE_INSTANCES_ALLOWED is hit. Did I miss anything? >>>> If pausing mode is supported in 0.8.0, we'd like to try it out. >>>> >>>> Thanks, >>>> Bo >>>> >>>> On Mon, Mar 19, 2018 at 6:18 PM, Lei Xia <[email protected]> wrote: >>>> >>>> Sorry, I totally missed this email thread. >>>> >>>> Yes, we do have such feature in 0.8 to protect the cluster in case of >>>> disasters happening. A new config option "MAX_OFFLINE_INSTANCES_ALLOWED" >>>> can be set in ClusterConfig. If it is set, and the number of offline >>>> instances reach to the set limit in the cluster, Helix will automatically >>>> pause (disable) the cluster, i.e, Helix will not react to any cluster >>>> changes anymore. You have to manually re-enable cluster (via >>>> HelixAdmin.enableCluster()) though. >>>> >>>> Keep in mind, once a cluster is disabled, no cluster event will be >>>> handled at all by the controller. For example, if an instance went offline >>>> and came back, Helix will not bring up any partitions on that instance if >>>> the cluster is disabled. >>>> >>>> This is somewhat a little coarse-grained. For that reason, we are >>>> going to introduce a new cluster mode, called "Maintenance mode". Once a >>>> cluster is in maintenance mode, it will not actively move partitions across >>>> instances, i.e, it will not bootstrap new partitions to any live instances. >>>> However, it will still maintain existing partitions to its desired states >>>> as it can. For instance, if an instance comes back, Helix will still bring >>>> all existing partitions on this instance to its desired states. Another >>>> example is, under maintenance mode, if there are only 1 replica for a given >>>> partition left active in the cluster, Helix will not try to bring >>>> additional new replicas, but Helix will still transition the only replica >>>> to its desired state (for example, master). >>>> >>>> Once we have this "Maintenance mode" support, we will put the cluster >>>> into maintenance mode during disaster, instead of totally disabling it, >>>> which leaves more automation here for Helix to recover the cluster from >>>> disaster. >>>> >>>> This feature will be included in our next release (0.8.1), which should >>>> be out in a couple of weeks. >>>> >>>> >>>> >>>> Lei >>>> >>>> On Mon, Mar 19, 2018 at 4:07 PM, Bo Liu <[email protected]> wrote: >>>> >>>> Just noticed that we have a cluster config >>>> "MAX_OFFLINE_INSTANCES_ALLOWED", which is used in >>>> https://github.com/apache/helix/blob/master/helix-core/sr >>>> c/main/java/org/apache/helix/controller/stages/BestPossible >>>> StateCalcStage.java#L70-L71 >>>> >>>> "If the offline/disabled instance number is above this threshold, the >>>> rebalancer will be paused." >>>> >>>> I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage? >>>> Will it help us with the case when a large portion or even the whole >>>> cluster disconnect to zk? >>>> >>>> >>>> >>>> >>>> On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu <[email protected]> wrote: >>>> >>>> I agree semi-auto is a safer mode for stateful service. But we will >>>> have to compute ideal state by ourselves (either manually triggered or >>>> triggered by live instance change events). That means we need to implement >>>> logic for delayed shard move and a shard placement algorithm. Not sure if >>>> there is any building blocks exposed by Helix that we could leverage for >>>> semi-auto mode. >>>> >>>> On Tue, Mar 6, 2018 at 7:12 PM, kishore g <[email protected]> wrote: >>>> >>>> This was one of the reasons we came up with the semi-auto mode. It's >>>> non-trivial to handle edge cases in full auto mode, especially for stateful >>>> services. Having said that, let's see what we can do in >>>> catastrophic scenarios. Having a check on the live instances changes is a >>>> good check but its hard to compute this reliably in some scenarios - for >>>> e.g. lets controllers also went down at the same time and came up back, >>>> they would have missed all the changes from ZK. >>>> >>>> I think it's better to limit the number of changes a controller would >>>> trigger in the cluster. This is where throttling and constraints can be >>>> used. Helix already has the ability limit the number of transitions in the >>>> cluster at once. But this limits the number of concurrent transitions not >>>> the number of transitions triggered in a time period. >>>> >>>> We can probably enhance this functionality to keep track of the number >>>> of transitions in last X minutes and limit that number. >>>> >>>> Any thoughts on that? >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu <[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> We are using delayed rebalancer to manage a Master-Slave cluster. >>>> In the event when a large portion of a cluster disconnect from ZK >>>> (network partition, or service crash due to a bug), helix controller will >>>> try hard to move shards to the rest of the cluster. >>>> This could make the thing worse if it's very expensive to rebuild a >>>> replica or there is no live replica left in the rest of the cluster. >>>> I am wondering what's the suggested way to handle this case? Is there a >>>> way to let Helix controller pause when the change of live instances is more >>>> than a threshold? >>>> >>>> -- >>>> Best regards, >>>> Bo >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Bo >>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Bo >>>> >>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Bo >>>> >>>> >>> > -- Best regards, Bo
