Hi, Michael
To answer your questions:
- Should you have to `rebalance` a resource when adding a new node to
the cluster?
*--- No, if you are using full-auto rebalance mode, yes if you are in
semi-auto rebalance mode. *
- Should you have to `rebalance` when a node is dropped? *-- Again, same
answer, No, you do not need to in full-auto mode. In full-auto mode, Helix
is supposed to detect nodes add/delete/online/offline and rebalance the
resource automatically. *
The problem you saw was because your resource was created in SEMI-AUTO
mode instead of FULL-AUTO mode. HelixAdmin.addResource() creates a
resource in semi-auto mode by default if you do not specify a rebalance
mode explicitly. Please see my comments below on how to fix it.
static void addResource() throws Exception {
echo("Adding resource " + RESOURCE_NAME);
ADMIN.addResource(CLUSTER_NAME, RESOURCE_NAME, NUM_PARTITIONS,
STATE_MODEL_NAME); *==> ADMIN.addResource(CLUSTER_NAME, RESOURCE_NAME,
NUM_PARTITIONS, STATE_MODEL_NAME, RebalanceMode.FULL_AUTO); *
echo("Rebalancing resource " + RESOURCE_NAME);
ADMIN.rebalance(CLUSTER_NAME, RESOURCE_NAME, NUM_REPLICAS); * // This
just needs to be called once after the resource was created, no need to
call when there is node change. *
}
Please give it a try and let me know whether it works. Thanks!
Lei
On Wed, Oct 19, 2016 at 11:52 PM, Michael Craig <[email protected]> wrote:
> Here is some repro code for "drop a node, resource is not redistributed"
> case I described: https://gist.github.com/mkscrg/
> bcb2ab1dd1b3e84ac93e7ca16e2824f8
>
> Can we answer these 2 questions? That would help clarify things:
>
> - Should you have to `rebalance` a resource when adding a new node to
> the cluster?
> - If no, this is an easy bug to reproduce. The example code
>
> <https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/examples/Quickstart.java#L198>
> calls rebalance after adding a node, and it breaks if you comment out
> that
> line.
> - If yes, what is the correct way to manage many resources on a
> cluster? Iterate through all resources and rebalance them for every new
> node?
> - Should you have to `rebalance` when a node is dropped?
> - If no, there is a bug. See the repro code posted above.
> - If yes, we are in the same rebalance-every-resource situation as
> above.
>
> My use case is to manage a set of ad-hoc tasks across a cluster of
> machines. Each task would be a separate resource with a unique name, with 1
> partition and 1 replica. Each resource would reside on exactly 1 node, and
> there is no limit on the number of resources per node.
>
> On Wed, Oct 19, 2016 at 9:23 PM, Lei Xia <[email protected]> wrote:
>
>> Hi, Michael
>>
>> Could you be more specific on the issue you see? Specifically:
>> 1) For 1 resource and 2 replicas, you mean the resource has only 1
>> partition, with replica number equals to 2, right?
>> 2) You see* REBALANCE_MODE="FULL_AUTO"*, not* IDEALSTATE_MODE="AUTO" *in
>> your idealState, right?
>> 3) by dropping N1, you mean disconnect N1 from helix/zookeeper, so N1
>> is not in liveInstances, right?
>>
>> If your answers to all of above questions are yes, then there may be
>> some bug here. If possible, please paste your idealstate, and your test
>> code (if there is any) here, I will try to reproduce and debug it. Thanks
>>
>>
>> Lei
>>
>> On Wed, Oct 19, 2016 at 9:02 PM, kishore g <[email protected]> wrote:
>>
>>> Can you describe your scenario in detail and the expected behavior?. I
>>> agree calling rebalance on every live instance change is ugly and
>>> definitely not as per the design. It was an oversight (we focussed a lot of
>>> large number of partitions and failed to handle this simple case).
>>>
>>> Please file and jira and we will work on that. Lei, do you think the
>>> recent bug we fixed with AutoRebalancer will handle this case?
>>>
>>> thanks,
>>> Kishore G
>>>
>>> On Wed, Oct 19, 2016 at 8:55 PM, Michael Craig <[email protected]> wrote:
>>>
>>>> Thanks for the quick response Kishore. This issue is definitely tied to
>>>> the condition that partitions * replicas < NODE_COUNT.
>>>> If all running nodes have a "piece" of the resource, then they behave
>>>> well when the LEADER node goes away.
>>>>
>>>> Is it possible to use Helix to manage a set of resources where that
>>>> condition is true? I.e. where the *total *number of
>>>> partitions/replicas in the cluster is greater than the node count, but each
>>>> individual resource has a small number of partitions/replicas.
>>>>
>>>> (Calling rebalance on every liveInstance change does not seem like a
>>>> good solution, because you would have to iterate through all resources in
>>>> the cluster and rebalance each individually.)
>>>>
>>>>
>>>> On Wed, Oct 19, 2016 at 12:52 PM, kishore g <[email protected]>
>>>> wrote:
>>>>
>>>>> I think this might be a corner case when partitions * replicas <
>>>>> TOTAL_NUMBER_OF_NODES. Can you try with many partitions and replicas and
>>>>> check if the issue still exists.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Oct 19, 2016 at 11:53 AM, Michael Craig <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I've noticed that partitions/replicas assigned to disconnected
>>>>>> instances are not automatically redistributed to live instances. What's
>>>>>> the
>>>>>> correct way to do this?
>>>>>>
>>>>>> For example, given this setup with Helix 0.6.5:
>>>>>> - 1 resource
>>>>>> - 2 replicas
>>>>>> - LeaderStandby state model
>>>>>> - FULL_AUTO rebalance mode
>>>>>> - 3 nodes (N1 is Leader, N2 is Standby, N3 is just sitting)
>>>>>>
>>>>>> Then drop N1:
>>>>>> - N2 becomes LEADER
>>>>>> - Nothing happens to N3
>>>>>>
>>>>>> Naively, I would have expected N3 to transition from Offline to
>>>>>> Standby, but that doesn't happen.
>>>>>>
>>>>>> I can force redistribution from
>>>>>> GenericHelixController#onLiveInstanceChange
>>>>>> by
>>>>>> - dropping non-live instances from the cluster
>>>>>> - calling rebalance
>>>>>>
>>>>>> The instance dropping seems pretty unsafe! Is there a better way?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Lei Xia
>>
>
>
--
*Lei Xia *Senior Software Engineer
Data Infra/Nuage & Helix
LinkedIn
[email protected]
www.linkedin.com/in/lxia1