Re: protect a cluster during broad range outage

Bo Liu Mon, 19 Mar 2018 20:17:46 -0700

Great, thanks a lot!
will try it out tomorrow.

On Mon, Mar 19, 2018 at 7:30 PM, Lei Xia <[email protected]> wrote:


> Here is the commit that adds maintenance mode support if you are
> interested in the implementation:  https://github.com/apache/helix/commit/
> a7477c3bbc85059b2e522f5caa214c33eb4c3e15
>
> On Mon, Mar 19, 2018 at 7:08 PM, Bo Liu <[email protected]> wrote:
>
>> Hi Lei,
>>
>> I only saw the code that set the maintenance flag. However, I don't see
>> any code that reads this flag. Would you point me to the code that
>> implements the maintenance mode logic?
>>
>> Thanks,
>> Bo
>>
>> On Mon, Mar 19, 2018, 18:48 kishore g <[email protected]> wrote:
>>
>>> Hi Lei,
>>>
>>> qq. What if the cluster was getting started for the first time. Will it
>>> get enabled only after min nodes are started?
>>>
>>> thanks
>>>
>>> On Mon, Mar 19, 2018 at 6:42 PM, Lei Xia <[email protected]> wrote:
>>>
>>>> Actually we already supported maintenance mode in 0.8.0.  My bad.
>>>>
>>>>
>>>> You can give it a try now,  with "MAX_OFFLINE_INSTANCES_ALLOWED" set
>>>> in ClusterConfig, once the # of offline instance reaches to the limit,
>>>> Helix will put the cluster into maintenance mode.
>>>>
>>>>
>>>> For now, you have to call HelixAdmin.enableMaintenanceMode() manually
>>>> to exit the maintenance mode.  Support of auto existing maintenance mode is
>>>> on our road-map.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Lei
>>>>
>>>>
>>>>
>>>>
>>>> *Lei Xia*
>>>>
>>>>
>>>> Data Infra/Helix
>>>>
>>>> [email protected]
>>>> www.linkedin.com/in/lxia1
>>>> ------------------------------
>>>> *From:* Bo Liu <[email protected]>
>>>> *Sent:* Monday, March 19, 2018 6:33:10 PM
>>>> *To:* [email protected]
>>>> *Subject:* Re: protect a cluster during broad range outage
>>>>
>>>> Hi Lei,
>>>>
>>>> Thank you so much for the detailed information.
>>>> We were almost about to implement our own logic to generate ideal state
>>>> to handle this disaster case (That's why we were looking at the code in
>>>> BestPossibleStateCalcStage.java).
>>>> Are you saying the pausing mode is already implemented in 0.8.0?
>>>> I looked at the code in validateOfflineInstancesLimit(), which only
>>>> set the maintenance flag not the pause flag when
>>>> MAX_OFFLINE_INSTANCES_ALLOWED is hit. Did I miss anything?
>>>> If pausing mode is supported in 0.8.0, we'd like to try it out.
>>>>
>>>> Thanks,
>>>> Bo
>>>>
>>>> On Mon, Mar 19, 2018 at 6:18 PM, Lei Xia <[email protected]> wrote:
>>>>
>>>> Sorry, I totally missed this email thread.
>>>>
>>>> Yes, we do have such feature in 0.8 to protect the cluster in case of
>>>> disasters happening.  A new config option "MAX_OFFLINE_INSTANCES_ALLOWED"
>>>> can be set in ClusterConfig.  If it is set, and the number of offline
>>>> instances reach to the set limit in the cluster, Helix will automatically
>>>> pause (disable) the cluster, i.e, Helix will not react to any cluster
>>>> changes anymore.  You have to manually re-enable cluster (via
>>>> HelixAdmin.enableCluster()) though.
>>>>
>>>> Keep in mind, once a cluster is disabled, no cluster event will be
>>>> handled at all by the controller. For example, if an instance went offline
>>>> and came back, Helix will not bring up any partitions on that instance if
>>>> the cluster is disabled.
>>>>
>>>> This is somewhat a little coarse-grained.  For that reason, we are
>>>> going to introduce a new cluster mode, called "Maintenance mode".  Once a
>>>> cluster is in maintenance mode, it will not actively move partitions across
>>>> instances, i.e, it will not bootstrap new partitions to any live instances.
>>>> However, it will still maintain existing partitions to its desired states
>>>> as it can. For instance, if an instance comes back, Helix will still bring
>>>> all existing partitions on this instance to its desired states.  Another
>>>> example is, under maintenance mode, if there are only 1 replica for a given
>>>> partition left active in the cluster, Helix will not try to bring
>>>> additional new replicas, but Helix will still transition the only replica
>>>> to its desired state (for example, master).
>>>>
>>>> Once we have this "Maintenance mode" support, we will put the cluster
>>>> into maintenance mode during disaster, instead of totally disabling it,
>>>> which leaves more automation here for Helix to recover the cluster from
>>>> disaster.
>>>>
>>>> This feature will be included in our next release (0.8.1), which should
>>>> be out in a couple of weeks.
>>>>
>>>>
>>>>
>>>> Lei
>>>>
>>>> On Mon, Mar 19, 2018 at 4:07 PM, Bo Liu <[email protected]> wrote:
>>>>
>>>> Just noticed that we have a cluster config
>>>> "MAX_OFFLINE_INSTANCES_ALLOWED", which is used in
>>>> https://github.com/apache/helix/blob/master/helix-core/sr
>>>> c/main/java/org/apache/helix/controller/stages/BestPossible
>>>> StateCalcStage.java#L70-L71
>>>>
>>>> "If the offline/disabled instance number is above this threshold, the
>>>> rebalancer will be paused."
>>>>
>>>> I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage?
>>>> Will it help us with the case when a large portion or even the whole
>>>> cluster disconnect to zk?
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu <[email protected]> wrote:
>>>>
>>>> I agree semi-auto is a safer mode for stateful service. But we will
>>>> have to compute ideal state by ourselves (either manually triggered or
>>>> triggered by live instance change events). That means we need to implement
>>>> logic for delayed shard move and a shard placement algorithm. Not sure if
>>>> there is any building blocks exposed by Helix that we could leverage for
>>>> semi-auto mode.
>>>>
>>>> On Tue, Mar 6, 2018 at 7:12 PM, kishore g <[email protected]> wrote:
>>>>
>>>> This was one of the reasons we came up with the semi-auto mode. It's
>>>> non-trivial to handle edge cases in full auto mode, especially for stateful
>>>> services. Having said that, let's see what we can do in
>>>> catastrophic scenarios. Having a check on the live instances changes is a
>>>> good check but its hard to compute this reliably in some scenarios - for
>>>> e.g. lets controllers also went down at the same time and came up back,
>>>> they would have missed all the changes from ZK.
>>>>
>>>> I think it's better to limit the number of changes a controller would
>>>> trigger in the cluster. This is where throttling and constraints can be
>>>> used. Helix already has the ability limit the number of transitions in the
>>>> cluster at once. But this limits the number of concurrent transitions not
>>>> the number of transitions triggered in a time period.
>>>>
>>>> We can probably enhance this functionality to keep track of the number
>>>> of transitions in last X minutes and limit that number.
>>>>
>>>> Any thoughts on that?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu <[email protected]> wrote:
>>>>
>>>> Hi,
>>>>
>>>> We are using delayed rebalancer to manage a Master-Slave cluster.
>>>> In the event when a large portion of a cluster disconnect from ZK
>>>> (network partition, or service crash due to a bug), helix controller will
>>>> try hard to move shards to the rest of the cluster.
>>>> This could make the thing worse if it's very expensive to rebuild a
>>>> replica or there is no live replica left in the rest of the cluster.
>>>> I am wondering what's the suggested way to handle this case? Is there a
>>>> way to let Helix controller pause when the change of live instances is more
>>>> than a threshold?
>>>>
>>>> --
>>>> Best regards,
>>>> Bo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Bo
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Bo
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Bo
>>>>
>>>>
>>>
>


-- 
Best regards,
Bo

Re: protect a cluster during broad range outage

Reply via email to