Re: protect a cluster during broad range outage

2018-03-19 Thread Bo Liu
Great, thanks a lot!
will try it out tomorrow.

On Mon, Mar 19, 2018 at 7:30 PM, Lei Xia <l...@apache.org> wrote:

> Here is the commit that adds maintenance mode support if you are
> interested in the implementation:  https://github.com/apache/helix/commit/
> a7477c3bbc85059b2e522f5caa214c33eb4c3e15
>
> On Mon, Mar 19, 2018 at 7:08 PM, Bo Liu <newpoo@gmail.com> wrote:
>
>> Hi Lei,
>>
>> I only saw the code that set the maintenance flag. However, I don't see
>> any code that reads this flag. Would you point me to the code that
>> implements the maintenance mode logic?
>>
>> Thanks,
>> Bo
>>
>> On Mon, Mar 19, 2018, 18:48 kishore g <g.kish...@gmail.com> wrote:
>>
>>> Hi Lei,
>>>
>>> qq. What if the cluster was getting started for the first time. Will it
>>> get enabled only after min nodes are started?
>>>
>>> thanks
>>>
>>> On Mon, Mar 19, 2018 at 6:42 PM, Lei Xia <l...@linkedin.com> wrote:
>>>
>>>> Actually we already supported maintenance mode in 0.8.0.  My bad.
>>>>
>>>>
>>>> You can give it a try now,  with "MAX_OFFLINE_INSTANCES_ALLOWED" set
>>>> in ClusterConfig, once the # of offline instance reaches to the limit,
>>>> Helix will put the cluster into maintenance mode.
>>>>
>>>>
>>>> For now, you have to call HelixAdmin.enableMaintenanceMode() manually
>>>> to exit the maintenance mode.  Support of auto existing maintenance mode is
>>>> on our road-map.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Lei
>>>>
>>>>
>>>>
>>>>
>>>> *Lei Xia*
>>>>
>>>>
>>>> Data Infra/Helix
>>>>
>>>> l...@linkedin.com
>>>> www.linkedin.com/in/lxia1
>>>> --
>>>> *From:* Bo Liu <newpoo@gmail.com>
>>>> *Sent:* Monday, March 19, 2018 6:33:10 PM
>>>> *To:* user@helix.apache.org
>>>> *Subject:* Re: protect a cluster during broad range outage
>>>>
>>>> Hi Lei,
>>>>
>>>> Thank you so much for the detailed information.
>>>> We were almost about to implement our own logic to generate ideal state
>>>> to handle this disaster case (That's why we were looking at the code in
>>>> BestPossibleStateCalcStage.java).
>>>> Are you saying the pausing mode is already implemented in 0.8.0?
>>>> I looked at the code in validateOfflineInstancesLimit(), which only
>>>> set the maintenance flag not the pause flag when
>>>> MAX_OFFLINE_INSTANCES_ALLOWED is hit. Did I miss anything?
>>>> If pausing mode is supported in 0.8.0, we'd like to try it out.
>>>>
>>>> Thanks,
>>>> Bo
>>>>
>>>> On Mon, Mar 19, 2018 at 6:18 PM, Lei Xia <l...@apache.org> wrote:
>>>>
>>>> Sorry, I totally missed this email thread.
>>>>
>>>> Yes, we do have such feature in 0.8 to protect the cluster in case of
>>>> disasters happening.  A new config option "MAX_OFFLINE_INSTANCES_ALLOWED"
>>>> can be set in ClusterConfig.  If it is set, and the number of offline
>>>> instances reach to the set limit in the cluster, Helix will automatically
>>>> pause (disable) the cluster, i.e, Helix will not react to any cluster
>>>> changes anymore.  You have to manually re-enable cluster (via
>>>> HelixAdmin.enableCluster()) though.
>>>>
>>>> Keep in mind, once a cluster is disabled, no cluster event will be
>>>> handled at all by the controller. For example, if an instance went offline
>>>> and came back, Helix will not bring up any partitions on that instance if
>>>> the cluster is disabled.
>>>>
>>>> This is somewhat a little coarse-grained.  For that reason, we are
>>>> going to introduce a new cluster mode, called "Maintenance mode".  Once a
>>>> cluster is in maintenance mode, it will not actively move partitions across
>>>> instances, i.e, it will not bootstrap new partitions to any live instances.
>>>> However, it will still maintain existing partitions to its desired states
>>>> as it can. For instance, if an instance comes back, Helix will still bring
>>>> all existing partitions on this instance to its desired states.  Another
>>>> example is, under maintenance m

Re: protect a cluster during broad range outage

2018-03-19 Thread Lei Xia
Here is the commit that adds maintenance mode support if you are interested
in the implementation:
https://github.com/apache/helix/commit/a7477c3bbc85059b2e522f5caa214c33eb4c3e15

On Mon, Mar 19, 2018 at 7:08 PM, Bo Liu <newpoo@gmail.com> wrote:

> Hi Lei,
>
> I only saw the code that set the maintenance flag. However, I don't see
> any code that reads this flag. Would you point me to the code that
> implements the maintenance mode logic?
>
> Thanks,
> Bo
>
> On Mon, Mar 19, 2018, 18:48 kishore g <g.kish...@gmail.com> wrote:
>
>> Hi Lei,
>>
>> qq. What if the cluster was getting started for the first time. Will it
>> get enabled only after min nodes are started?
>>
>> thanks
>>
>> On Mon, Mar 19, 2018 at 6:42 PM, Lei Xia <l...@linkedin.com> wrote:
>>
>>> Actually we already supported maintenance mode in 0.8.0.  My bad.
>>>
>>>
>>> You can give it a try now,  with "MAX_OFFLINE_INSTANCES_ALLOWED" set in
>>> ClusterConfig, once the # of offline instance reaches to the limit, Helix
>>> will put the cluster into maintenance mode.
>>>
>>>
>>> For now, you have to call HelixAdmin.enableMaintenanceMode() manually
>>> to exit the maintenance mode.  Support of auto existing maintenance mode is
>>> on our road-map.
>>>
>>>
>>>
>>>
>>>
>>> Lei
>>>
>>>
>>>
>>>
>>> *Lei Xia*
>>>
>>>
>>> Data Infra/Helix
>>>
>>> l...@linkedin.com
>>> www.linkedin.com/in/lxia1
>>> --
>>> *From:* Bo Liu <newpoo@gmail.com>
>>> *Sent:* Monday, March 19, 2018 6:33:10 PM
>>> *To:* user@helix.apache.org
>>> *Subject:* Re: protect a cluster during broad range outage
>>>
>>> Hi Lei,
>>>
>>> Thank you so much for the detailed information.
>>> We were almost about to implement our own logic to generate ideal state
>>> to handle this disaster case (That's why we were looking at the code in
>>> BestPossibleStateCalcStage.java).
>>> Are you saying the pausing mode is already implemented in 0.8.0?
>>> I looked at the code in validateOfflineInstancesLimit(), which only set
>>> the maintenance flag not the pause flag when MAX_OFFLINE_INSTANCES_ALLOWED
>>> is hit. Did I miss anything?
>>> If pausing mode is supported in 0.8.0, we'd like to try it out.
>>>
>>> Thanks,
>>> Bo
>>>
>>> On Mon, Mar 19, 2018 at 6:18 PM, Lei Xia <l...@apache.org> wrote:
>>>
>>> Sorry, I totally missed this email thread.
>>>
>>> Yes, we do have such feature in 0.8 to protect the cluster in case of
>>> disasters happening.  A new config option "MAX_OFFLINE_INSTANCES_ALLOWED"
>>> can be set in ClusterConfig.  If it is set, and the number of offline
>>> instances reach to the set limit in the cluster, Helix will automatically
>>> pause (disable) the cluster, i.e, Helix will not react to any cluster
>>> changes anymore.  You have to manually re-enable cluster (via
>>> HelixAdmin.enableCluster()) though.
>>>
>>> Keep in mind, once a cluster is disabled, no cluster event will be
>>> handled at all by the controller. For example, if an instance went offline
>>> and came back, Helix will not bring up any partitions on that instance if
>>> the cluster is disabled.
>>>
>>> This is somewhat a little coarse-grained.  For that reason, we are going
>>> to introduce a new cluster mode, called "Maintenance mode".  Once a cluster
>>> is in maintenance mode, it will not actively move partitions across
>>> instances, i.e, it will not bootstrap new partitions to any live instances.
>>> However, it will still maintain existing partitions to its desired states
>>> as it can. For instance, if an instance comes back, Helix will still bring
>>> all existing partitions on this instance to its desired states.  Another
>>> example is, under maintenance mode, if there are only 1 replica for a given
>>> partition left active in the cluster, Helix will not try to bring
>>> additional new replicas, but Helix will still transition the only replica
>>> to its desired state (for example, master).
>>>
>>> Once we have this "Maintenance mode" support, we will put the cluster
>>> into maintenance mode during disaster, instead of totally disabling it,
>>> which leaves more automation here for Helix to r

Re: protect a cluster during broad range outage

2018-03-19 Thread kishore g
Hi Lei,

qq. What if the cluster was getting started for the first time. Will it get
enabled only after min nodes are started?

thanks

On Mon, Mar 19, 2018 at 6:42 PM, Lei Xia <l...@linkedin.com> wrote:

> Actually we already supported maintenance mode in 0.8.0.  My bad.
>
>
> You can give it a try now,  with "MAX_OFFLINE_INSTANCES_ALLOWED" set in
> ClusterConfig, once the # of offline instance reaches to the limit, Helix
> will put the cluster into maintenance mode.
>
>
> For now, you have to call HelixAdmin.enableMaintenanceMode() manually to
> exit the maintenance mode.  Support of auto existing maintenance mode is on
> our road-map.
>
>
>
>
>
> Lei
>
>
>
>
> *Lei Xia*
>
>
> Data Infra/Helix
>
> l...@linkedin.com
> www.linkedin.com/in/lxia1
> --
> *From:* Bo Liu <newpoo@gmail.com>
> *Sent:* Monday, March 19, 2018 6:33:10 PM
> *To:* user@helix.apache.org
> *Subject:* Re: protect a cluster during broad range outage
>
> Hi Lei,
>
> Thank you so much for the detailed information.
> We were almost about to implement our own logic to generate ideal state to
> handle this disaster case (That's why we were looking at the code in
> BestPossibleStateCalcStage.java).
> Are you saying the pausing mode is already implemented in 0.8.0?
> I looked at the code in validateOfflineInstancesLimit(), which only set
> the maintenance flag not the pause flag when MAX_OFFLINE_INSTANCES_ALLOWED
> is hit. Did I miss anything?
> If pausing mode is supported in 0.8.0, we'd like to try it out.
>
> Thanks,
> Bo
>
> On Mon, Mar 19, 2018 at 6:18 PM, Lei Xia <l...@apache.org> wrote:
>
> Sorry, I totally missed this email thread.
>
> Yes, we do have such feature in 0.8 to protect the cluster in case of
> disasters happening.  A new config option "MAX_OFFLINE_INSTANCES_ALLOWED"
> can be set in ClusterConfig.  If it is set, and the number of offline
> instances reach to the set limit in the cluster, Helix will automatically
> pause (disable) the cluster, i.e, Helix will not react to any cluster
> changes anymore.  You have to manually re-enable cluster (via
> HelixAdmin.enableCluster()) though.
>
> Keep in mind, once a cluster is disabled, no cluster event will be handled
> at all by the controller. For example, if an instance went offline and came
> back, Helix will not bring up any partitions on that instance if the
> cluster is disabled.
>
> This is somewhat a little coarse-grained.  For that reason, we are going
> to introduce a new cluster mode, called "Maintenance mode".  Once a cluster
> is in maintenance mode, it will not actively move partitions across
> instances, i.e, it will not bootstrap new partitions to any live instances.
> However, it will still maintain existing partitions to its desired states
> as it can. For instance, if an instance comes back, Helix will still bring
> all existing partitions on this instance to its desired states.  Another
> example is, under maintenance mode, if there are only 1 replica for a given
> partition left active in the cluster, Helix will not try to bring
> additional new replicas, but Helix will still transition the only replica
> to its desired state (for example, master).
>
> Once we have this "Maintenance mode" support, we will put the cluster into
> maintenance mode during disaster, instead of totally disabling it, which
> leaves more automation here for Helix to recover the cluster from disaster.
>
> This feature will be included in our next release (0.8.1), which should be
> out in a couple of weeks.
>
>
>
> Lei
>
> On Mon, Mar 19, 2018 at 4:07 PM, Bo Liu <newpoo@gmail.com> wrote:
>
> Just noticed that we have a cluster config "MAX_OFFLINE_INSTANCES_ALLOWED",
> which is used in https://github.com/apache/helix/blob/master/helix-core/sr
> c/main/java/org/apache/helix/controller/stages/BestPossibleS
> tateCalcStage.java#L70-L71
>
> "If the offline/disabled instance number is above this threshold, the
> rebalancer will be paused."
>
> I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage?
> Will it help us with the case when a large portion or even the whole
> cluster disconnect to zk?
>
>
>
>
> On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu <newpoo@gmail.com> wrote:
>
> I agree semi-auto is a safer mode for stateful service. But we will have
> to compute ideal state by ourselves (either manually triggered or triggered
> by live instance change events). That means we need to implement logic for
> delayed shard move and a shard placement algorithm. Not sure if there is
> any building blocks exposed by Helix t

Re: protect a cluster during broad range outage

2018-03-19 Thread Lei Xia
Actually we already supported maintenance mode in 0.8.0.  My bad.


You can give it a try now,  with "MAX_OFFLINE_INSTANCES_ALLOWED" set in 
ClusterConfig, once the # of offline instance reaches to the limit, Helix will 
put the cluster into maintenance mode.


For now, you have to call HelixAdmin.enableMaintenanceMode() manually to exit 
the maintenance mode.  Support of auto existing maintenance mode is on our 
road-map.





Lei




Lei Xia


Data Infra/Helix

l...@linkedin.com<mailto:l...@linkedin.com>
www.linkedin.com/in/lxia1<http://www.linkedin.com/in/lxia1>


From: Bo Liu <newpoo@gmail.com>
Sent: Monday, March 19, 2018 6:33:10 PM
To: user@helix.apache.org
Subject: Re: protect a cluster during broad range outage

Hi Lei,

Thank you so much for the detailed information.
We were almost about to implement our own logic to generate ideal state to 
handle this disaster case (That's why we were looking at the code in 
BestPossibleStateCalcStage.java).
Are you saying the pausing mode is already implemented in 0.8.0?
I looked at the code in validateOfflineInstancesLimit(), which only set the 
maintenance flag not the pause flag when MAX_OFFLINE_INSTANCES_ALLOWED is hit. 
Did I miss anything?
If pausing mode is supported in 0.8.0, we'd like to try it out.

Thanks,
Bo

On Mon, Mar 19, 2018 at 6:18 PM, Lei Xia 
<l...@apache.org<mailto:l...@apache.org>> wrote:
Sorry, I totally missed this email thread.

Yes, we do have such feature in 0.8 to protect the cluster in case of disasters 
happening.  A new config option "MAX_OFFLINE_INSTANCES_ALLOWED" can be set in 
ClusterConfig.  If it is set, and the number of offline instances reach to the 
set limit in the cluster, Helix will automatically pause (disable) the cluster, 
i.e, Helix will not react to any cluster changes anymore.  You have to manually 
re-enable cluster (via HelixAdmin.enableCluster()) though.

Keep in mind, once a cluster is disabled, no cluster event will be handled at 
all by the controller. For example, if an instance went offline and came back, 
Helix will not bring up any partitions on that instance if the cluster is 
disabled.

This is somewhat a little coarse-grained.  For that reason, we are going to 
introduce a new cluster mode, called "Maintenance mode".  Once a cluster is in 
maintenance mode, it will not actively move partitions across instances, i.e, 
it will not bootstrap new partitions to any live instances. However, it will 
still maintain existing partitions to its desired states as it can. For 
instance, if an instance comes back, Helix will still bring all existing 
partitions on this instance to its desired states.  Another example is, under 
maintenance mode, if there are only 1 replica for a given partition left active 
in the cluster, Helix will not try to bring additional new replicas, but Helix 
will still transition the only replica to its desired state (for example, 
master).

Once we have this "Maintenance mode" support, we will put the cluster into 
maintenance mode during disaster, instead of totally disabling it, which leaves 
more automation here for Helix to recover the cluster from disaster.

This feature will be included in our next release (0.8.1), which should be out 
in a couple of weeks.



Lei

On Mon, Mar 19, 2018 at 4:07 PM, Bo Liu 
<newpoo@gmail.com<mailto:newpoo@gmail.com>> wrote:
Just noticed that we have a cluster config "MAX_OFFLINE_INSTANCES_ALLOWED", 
which is used in 
https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java#L70-L71

"If the offline/disabled instance number is above this threshold, the 
rebalancer will be paused."

I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage?
Will it help us with the case when a large portion or even the whole cluster 
disconnect to zk?




On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu 
<newpoo@gmail.com<mailto:newpoo@gmail.com>> wrote:
I agree semi-auto is a safer mode for stateful service. But we will have to 
compute ideal state by ourselves (either manually triggered or triggered by 
live instance change events). That means we need to implement logic for delayed 
shard move and a shard placement algorithm. Not sure if there is any building 
blocks exposed by Helix that we could leverage for semi-auto mode.

On Tue, Mar 6, 2018 at 7:12 PM, kishore g 
<g.kish...@gmail.com<mailto:g.kish...@gmail.com>> wrote:
This was one of the reasons we came up with the semi-auto mode. It's 
non-trivial to handle edge cases in full auto mode, especially for stateful 
services. Having said that, let's see what we can do in catastrophic scenarios. 
Having a check on the live instances changes is a good check but its hard to 
compute this reliably in some scenarios - for e.g. lets controllers also went 
dow

Re: protect a cluster during broad range outage

2018-03-19 Thread Bo Liu
Hi Lei,

Thank you so much for the detailed information.
We were almost about to implement our own logic to generate ideal state to
handle this disaster case (That's why we were looking at the code in
BestPossibleStateCalcStage.java).
Are you saying the pausing mode is already implemented in 0.8.0?
I looked at the code in validateOfflineInstancesLimit(), which only set the
maintenance flag not the pause flag when MAX_OFFLINE_INSTANCES_ALLOWED is
hit. Did I miss anything?
If pausing mode is supported in 0.8.0, we'd like to try it out.

Thanks,
Bo

On Mon, Mar 19, 2018 at 6:18 PM, Lei Xia  wrote:

> Sorry, I totally missed this email thread.
>
> Yes, we do have such feature in 0.8 to protect the cluster in case of
> disasters happening.  A new config option "MAX_OFFLINE_INSTANCES_ALLOWED"
> can be set in ClusterConfig.  If it is set, and the number of offline
> instances reach to the set limit in the cluster, Helix will automatically
> pause (disable) the cluster, i.e, Helix will not react to any cluster
> changes anymore.  You have to manually re-enable cluster (via
> HelixAdmin.enableCluster()) though.
>
> Keep in mind, once a cluster is disabled, no cluster event will be handled
> at all by the controller. For example, if an instance went offline and came
> back, Helix will not bring up any partitions on that instance if the
> cluster is disabled.
>
> This is somewhat a little coarse-grained.  For that reason, we are going
> to introduce a new cluster mode, called "Maintenance mode".  Once a cluster
> is in maintenance mode, it will not actively move partitions across
> instances, i.e, it will not bootstrap new partitions to any live instances.
> However, it will still maintain existing partitions to its desired states
> as it can. For instance, if an instance comes back, Helix will still bring
> all existing partitions on this instance to its desired states.  Another
> example is, under maintenance mode, if there are only 1 replica for a given
> partition left active in the cluster, Helix will not try to bring
> additional new replicas, but Helix will still transition the only replica
> to its desired state (for example, master).
>
> Once we have this "Maintenance mode" support, we will put the cluster into
> maintenance mode during disaster, instead of totally disabling it, which
> leaves more automation here for Helix to recover the cluster from disaster.
>
> This feature will be included in our next release (0.8.1), which should be
> out in a couple of weeks.
>
>
>
> Lei
>
> On Mon, Mar 19, 2018 at 4:07 PM, Bo Liu  wrote:
>
>> Just noticed that we have a cluster config "MAX_OFFLINE_INSTANCES_ALLOWED",
>> which is used in https://github.com/apache/h
>> elix/blob/master/helix-core/src/main/java/org/apache/helix/
>> controller/stages/BestPossibleStateCalcStage.java#L70-L71
>>
>> "If the offline/disabled instance number is above this threshold, the
>> rebalancer will be paused."
>>
>> I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage?
>> Will it help us with the case when a large portion or even the whole
>> cluster disconnect to zk?
>>
>>
>>
>>
>> On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu  wrote:
>>
>>> I agree semi-auto is a safer mode for stateful service. But we will have
>>> to compute ideal state by ourselves (either manually triggered or triggered
>>> by live instance change events). That means we need to implement logic for
>>> delayed shard move and a shard placement algorithm. Not sure if there is
>>> any building blocks exposed by Helix that we could leverage for semi-auto
>>> mode.
>>>
>>> On Tue, Mar 6, 2018 at 7:12 PM, kishore g  wrote:
>>>
 This was one of the reasons we came up with the semi-auto mode. It's
 non-trivial to handle edge cases in full auto mode, especially for stateful
 services. Having said that, let's see what we can do in
 catastrophic scenarios. Having a check on the live instances changes is a
 good check but its hard to compute this reliably in some scenarios - for
 e.g. lets controllers also went down at the same time and came up back,
 they would have missed all the changes from ZK.

 I think it's better to limit the number of changes a controller would
 trigger in the cluster. This is where throttling and constraints can be
 used. Helix already has the ability limit the number of transitions in the
 cluster at once. But this limits the number of concurrent transitions not
 the number of transitions triggered in a time period.

 We can probably enhance this functionality to keep track of the number
 of transitions in last X minutes and limit that number.

 Any thoughts on that?







 On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu  wrote:

> Hi,
>
> We are using delayed rebalancer to manage a Master-Slave cluster.
> In the event when 

Re: protect a cluster during broad range outage

2018-03-19 Thread Lei Xia
Sorry, I totally missed this email thread.

Yes, we do have such feature in 0.8 to protect the cluster in case of
disasters happening.  A new config option "MAX_OFFLINE_INSTANCES_ALLOWED"
can be set in ClusterConfig.  If it is set, and the number of offline
instances reach to the set limit in the cluster, Helix will automatically
pause (disable) the cluster, i.e, Helix will not react to any cluster
changes anymore.  You have to manually re-enable cluster (via
HelixAdmin.enableCluster()) though.

Keep in mind, once a cluster is disabled, no cluster event will be handled
at all by the controller. For example, if an instance went offline and came
back, Helix will not bring up any partitions on that instance if the
cluster is disabled.

This is somewhat a little coarse-grained.  For that reason, we are going to
introduce a new cluster mode, called "Maintenance mode".  Once a cluster is
in maintenance mode, it will not actively move partitions across instances,
i.e, it will not bootstrap new partitions to any live instances. However,
it will still maintain existing partitions to its desired states as it can.
For instance, if an instance comes back, Helix will still bring all
existing partitions on this instance to its desired states.  Another
example is, under maintenance mode, if there are only 1 replica for a given
partition left active in the cluster, Helix will not try to bring
additional new replicas, but Helix will still transition the only replica
to its desired state (for example, master).

Once we have this "Maintenance mode" support, we will put the cluster into
maintenance mode during disaster, instead of totally disabling it, which
leaves more automation here for Helix to recover the cluster from disaster.

This feature will be included in our next release (0.8.1), which should be
out in a couple of weeks.



Lei

On Mon, Mar 19, 2018 at 4:07 PM, Bo Liu  wrote:

> Just noticed that we have a cluster config "MAX_OFFLINE_INSTANCES_ALLOWED",
> which is used in https://github.com/apache/helix/blob/master/helix-core/
> src/main/java/org/apache/helix/controller/stages/
> BestPossibleStateCalcStage.java#L70-L71
>
> "If the offline/disabled instance number is above this threshold, the
> rebalancer will be paused."
>
> I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage?
> Will it help us with the case when a large portion or even the whole
> cluster disconnect to zk?
>
>
>
>
> On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu  wrote:
>
>> I agree semi-auto is a safer mode for stateful service. But we will have
>> to compute ideal state by ourselves (either manually triggered or triggered
>> by live instance change events). That means we need to implement logic for
>> delayed shard move and a shard placement algorithm. Not sure if there is
>> any building blocks exposed by Helix that we could leverage for semi-auto
>> mode.
>>
>> On Tue, Mar 6, 2018 at 7:12 PM, kishore g  wrote:
>>
>>> This was one of the reasons we came up with the semi-auto mode. It's
>>> non-trivial to handle edge cases in full auto mode, especially for stateful
>>> services. Having said that, let's see what we can do in
>>> catastrophic scenarios. Having a check on the live instances changes is a
>>> good check but its hard to compute this reliably in some scenarios - for
>>> e.g. lets controllers also went down at the same time and came up back,
>>> they would have missed all the changes from ZK.
>>>
>>> I think it's better to limit the number of changes a controller would
>>> trigger in the cluster. This is where throttling and constraints can be
>>> used. Helix already has the ability limit the number of transitions in the
>>> cluster at once. But this limits the number of concurrent transitions not
>>> the number of transitions triggered in a time period.
>>>
>>> We can probably enhance this functionality to keep track of the number
>>> of transitions in last X minutes and limit that number.
>>>
>>> Any thoughts on that?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu  wrote:
>>>
 Hi,

 We are using delayed rebalancer to manage a Master-Slave cluster.
 In the event when a large portion of a cluster disconnect from ZK
 (network partition, or service crash due to a bug), helix controller will
 try hard to move shards to the rest of the cluster.
 This could make the thing worse if it's very expensive to rebuild a
 replica or there is no live replica left in the rest of the cluster.
 I am wondering what's the suggested way to handle this case? Is there a
 way to let Helix controller pause when the change of live instances is more
 than a threshold?

 --
 Best regards,
 Bo


>>>
>>
>>
>> --
>> Best regards,
>> Bo
>>
>>
>
>
> --
> Best regards,
> Bo
>
>


Re: protect a cluster during broad range outage

2018-03-19 Thread Bo Liu
Just noticed that we have a cluster config "MAX_OFFLINE_INSTANCES_ALLOWED",
which is used in
https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java#L70-L71

"If the offline/disabled instance number is above this threshold, the
rebalancer will be paused."

I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage?
Will it help us with the case when a large portion or even the whole
cluster disconnect to zk?




On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu  wrote:

> I agree semi-auto is a safer mode for stateful service. But we will have
> to compute ideal state by ourselves (either manually triggered or triggered
> by live instance change events). That means we need to implement logic for
> delayed shard move and a shard placement algorithm. Not sure if there is
> any building blocks exposed by Helix that we could leverage for semi-auto
> mode.
>
> On Tue, Mar 6, 2018 at 7:12 PM, kishore g  wrote:
>
>> This was one of the reasons we came up with the semi-auto mode. It's
>> non-trivial to handle edge cases in full auto mode, especially for stateful
>> services. Having said that, let's see what we can do in
>> catastrophic scenarios. Having a check on the live instances changes is a
>> good check but its hard to compute this reliably in some scenarios - for
>> e.g. lets controllers also went down at the same time and came up back,
>> they would have missed all the changes from ZK.
>>
>> I think it's better to limit the number of changes a controller would
>> trigger in the cluster. This is where throttling and constraints can be
>> used. Helix already has the ability limit the number of transitions in the
>> cluster at once. But this limits the number of concurrent transitions not
>> the number of transitions triggered in a time period.
>>
>> We can probably enhance this functionality to keep track of the number of
>> transitions in last X minutes and limit that number.
>>
>> Any thoughts on that?
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu  wrote:
>>
>>> Hi,
>>>
>>> We are using delayed rebalancer to manage a Master-Slave cluster.
>>> In the event when a large portion of a cluster disconnect from ZK
>>> (network partition, or service crash due to a bug), helix controller will
>>> try hard to move shards to the rest of the cluster.
>>> This could make the thing worse if it's very expensive to rebuild a
>>> replica or there is no live replica left in the rest of the cluster.
>>> I am wondering what's the suggested way to handle this case? Is there a
>>> way to let Helix controller pause when the change of live instances is more
>>> than a threshold?
>>>
>>> --
>>> Best regards,
>>> Bo
>>>
>>>
>>
>
>
> --
> Best regards,
> Bo
>
>


-- 
Best regards,
Bo


Re: protect a cluster during broad range outage

2018-03-06 Thread Bo Liu
I agree semi-auto is a safer mode for stateful service. But we will have to
compute ideal state by ourselves (either manually triggered or triggered by
live instance change events). That means we need to implement logic for
delayed shard move and a shard placement algorithm. Not sure if there is
any building blocks exposed by Helix that we could leverage for semi-auto
mode.

On Tue, Mar 6, 2018 at 7:12 PM, kishore g  wrote:

> This was one of the reasons we came up with the semi-auto mode. It's
> non-trivial to handle edge cases in full auto mode, especially for stateful
> services. Having said that, let's see what we can do in
> catastrophic scenarios. Having a check on the live instances changes is a
> good check but its hard to compute this reliably in some scenarios - for
> e.g. lets controllers also went down at the same time and came up back,
> they would have missed all the changes from ZK.
>
> I think it's better to limit the number of changes a controller would
> trigger in the cluster. This is where throttling and constraints can be
> used. Helix already has the ability limit the number of transitions in the
> cluster at once. But this limits the number of concurrent transitions not
> the number of transitions triggered in a time period.
>
> We can probably enhance this functionality to keep track of the number of
> transitions in last X minutes and limit that number.
>
> Any thoughts on that?
>
>
>
>
>
>
>
> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu  wrote:
>
>> Hi,
>>
>> We are using delayed rebalancer to manage a Master-Slave cluster.
>> In the event when a large portion of a cluster disconnect from ZK
>> (network partition, or service crash due to a bug), helix controller will
>> try hard to move shards to the rest of the cluster.
>> This could make the thing worse if it's very expensive to rebuild a
>> replica or there is no live replica left in the rest of the cluster.
>> I am wondering what's the suggested way to handle this case? Is there a
>> way to let Helix controller pause when the change of live instances is more
>> than a threshold?
>>
>> --
>> Best regards,
>> Bo
>>
>>
>


-- 
Best regards,
Bo


Re: protect a cluster during broad range outage

2018-03-06 Thread kishore g
This was one of the reasons we came up with the semi-auto mode. It's
non-trivial to handle edge cases in full auto mode, especially for stateful
services. Having said that, let's see what we can do in
catastrophic scenarios. Having a check on the live instances changes is a
good check but its hard to compute this reliably in some scenarios - for
e.g. lets controllers also went down at the same time and came up back,
they would have missed all the changes from ZK.

I think it's better to limit the number of changes a controller would
trigger in the cluster. This is where throttling and constraints can be
used. Helix already has the ability limit the number of transitions in the
cluster at once. But this limits the number of concurrent transitions not
the number of transitions triggered in a time period.

We can probably enhance this functionality to keep track of the number of
transitions in last X minutes and limit that number.

Any thoughts on that?







On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu  wrote:

> Hi,
>
> We are using delayed rebalancer to manage a Master-Slave cluster.
> In the event when a large portion of a cluster disconnect from ZK (network
> partition, or service crash due to a bug), helix controller will try hard
> to move shards to the rest of the cluster.
> This could make the thing worse if it's very expensive to rebuild a
> replica or there is no live replica left in the rest of the cluster.
> I am wondering what's the suggested way to handle this case? Is there a
> way to let Helix controller pause when the change of live instances is more
> than a threshold?
>
> --
> Best regards,
> Bo
>
>