Re: protect a cluster during broad range outage

Bo Liu Tue, 06 Mar 2018 22:51:51 -0800

I agree semi-auto is a safer mode for stateful service. But we will have to
compute ideal state by ourselves (either manually triggered or triggered by
live instance change events). That means we need to implement logic for
delayed shard move and a shard placement algorithm. Not sure if there is
any building blocks exposed by Helix that we could leverage for semi-auto
mode.


On Tue, Mar 6, 2018 at 7:12 PM, kishore g <[email protected]> wrote:

> This was one of the reasons we came up with the semi-auto mode. It's
> non-trivial to handle edge cases in full auto mode, especially for stateful
> services. Having said that, let's see what we can do in
> catastrophic scenarios. Having a check on the live instances changes is a
> good check but its hard to compute this reliably in some scenarios - for
> e.g. lets controllers also went down at the same time and came up back,
> they would have missed all the changes from ZK.
>
> I think it's better to limit the number of changes a controller would
> trigger in the cluster. This is where throttling and constraints can be
> used. Helix already has the ability limit the number of transitions in the
> cluster at once. But this limits the number of concurrent transitions not
> the number of transitions triggered in a time period.
>
> We can probably enhance this functionality to keep track of the number of
> transitions in last X minutes and limit that number.
>
> Any thoughts on that?
>
>
>
>
>
>
>
> On Tue, Mar 6, 2018 at 4:30 PM, Bo Liu <[email protected]> wrote:
>
>> Hi,
>>
>> We are using delayed rebalancer to manage a Master-Slave cluster.
>> In the event when a large portion of a cluster disconnect from ZK
>> (network partition, or service crash due to a bug), helix controller will
>> try hard to move shards to the rest of the cluster.
>> This could make the thing worse if it's very expensive to rebuild a
>> replica or there is no live replica left in the rest of the cluster.
>> I am wondering what's the suggested way to handle this case? Is there a
>> way to let Helix controller pause when the change of live instances is more
>> than a threshold?
>>
>> --
>> Best regards,
>> Bo
>>
>>
>


-- 
Best regards,
Bo

Re: protect a cluster during broad range outage

Reply via email to