Re: protect a cluster during broad range outage

Bo Liu Mon, 19 Mar 2018 16:09:04 -0700

Just noticed that we have a cluster config "MAX_OFFLINE_INSTANCES_ALLOWED",
which is used in
https://github.com/apache/helix/blob/master/helix-core/src/main/java/org/apache/helix/controller/stages/BestPossibleStateCalcStage.java#L70-L71


"If the offline/disabled instance number is above this threshold, the
rebalancer will be paused."

I am wondering if the FULL_AUTO mode has BestPossibleStateCalcStage?
Will it help us with the case when a large portion or even the whole
cluster disconnect to zk?




On Tue, Mar 6, 2018 at 10:51 PM, Bo Liu <[email protected]> wrote:

> I agree semi-auto is a safer mode for stateful service. But we will have
> to compute ideal state by ourselves (either manually triggered or triggered
> by live instance change events). That means we need to implement logic for
> delayed shard move and a shard placement algorithm. Not sure if there is
> any building blocks exposed by Helix that we could leverage for semi-auto
> mode.
>
> On Tue, Mar 6, 2018 at 7:12 PM, kishore g <[email protected]> wrote:
>
>> This was one of the reasons we came up with the semi-auto mode. It's
>> non-trivial to handle edge cases in full auto mode, especially for stateful
>> services. Having said that, let's see what we can do in
>> catastrophic scenarios. Having a check on the live instances changes is a
>> good check but its hard to compute this reliably in some scenarios - for
>> e.g. lets controllers also went down at the same time and came up back,
>> they would have missed all the changes from ZK.
>>
>> I think it's better to limit the number of changes a controller would
>> trigger in the cluster. This is where throttling and constraints can be
>> used. Helix already has the ability limit the number of transitions in the
>> cluster at once. But this limits the number of concurrent transitions not
>> the number of transitions triggered in a time period.
>>
>> We can probably enhance this functionality to keep track of the number of
>> transitions in last X minutes and limit that number.
>>
>> Any thoughts on that?
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> We are using delayed rebalancer to manage a Master-Slave cluster.
>>> In the event when a large portion of a cluster disconnect from ZK
>>> (network partition, or service crash due to a bug), helix controller will
>>> try hard to move shards to the rest of the cluster.
>>> This could make the thing worse if it's very expensive to rebuild a
>>> replica or there is no live replica left in the rest of the cluster.
>>> I am wondering what's the suggested way to handle this case? Is there a
>>> way to let Helix controller pause when the change of live instances is more
>>> than a threshold?
>>>
>>> --
>>> Best regards,
>>> Bo
>>>
>>>
>>
>
>
> --
> Best regards,
> Bo
>
>


-- 
Best regards,
Bo

Re: protect a cluster during broad range outage

Reply via email to