Re: Correct way to redistribute work from disconnected instances?

Michael Craig Wed, 19 Oct 2016 20:55:42 -0700

Thanks for the quick response Kishore. This issue is definitely tied to the
condition that partitions * replicas < NODE_COUNT.
If all running nodes have a "piece" of the resource, then they behave well
when the LEADER node goes away.


Is it possible to use Helix to manage a set of resources where that
condition is true? I.e. where the *total *number of partitions/replicas in
the cluster is greater than the node count, but each individual resource
has a small number of partitions/replicas.

(Calling rebalance on every liveInstance change does not seem like a good
solution, because you would have to iterate through all resources in the
cluster and rebalance each individually.)


On Wed, Oct 19, 2016 at 12:52 PM, kishore g <[email protected]> wrote:

> I think this might be a corner case when partitions * replicas <
> TOTAL_NUMBER_OF_NODES. Can you try with many partitions and replicas and
> check if the issue still exists.
>
>
>
> On Wed, Oct 19, 2016 at 11:53 AM, Michael Craig <[email protected]> wrote:
>
>> I've noticed that partitions/replicas assigned to disconnected instances
>> are not automatically redistributed to live instances. What's the correct
>> way to do this?
>>
>> For example, given this setup with Helix 0.6.5:
>> - 1 resource
>> - 2 replicas
>> - LeaderStandby state model
>> - FULL_AUTO rebalance mode
>> - 3 nodes (N1 is Leader, N2 is Standby, N3 is just sitting)
>>
>> Then drop N1:
>> - N2 becomes LEADER
>> - Nothing happens to N3
>>
>> Naively, I would have expected N3 to transition from Offline to Standby,
>> but that doesn't happen.
>>
>> I can force redistribution from GenericHelixController#onLiveInstanceChange
>> by
>> - dropping non-live instances from the cluster
>> - calling rebalance
>>
>> The instance dropping seems pretty unsafe! Is there a better way?
>>
>
>

Re: Correct way to redistribute work from disconnected instances?

Reply via email to