Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Junkai Xue Tue, 07 Jun 2022 19:29:49 -0700

What I mean is the ZNode inside the Zookeeper under path of /[your cluster
name]/CONFIGS/CLUSTER/[your cluster name]


Best,

Junkai

On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <[email protected]>
wrote:

> Yes, it involves enable/disable operations as the server comes up and
> down. In the logs we would sometimes not see the host in the "Current quota
> capacity" log message, either.
>
> When you refer to Cluster Config, did you mean what's accessible by
> "-listClusterInfo helix-ctrl" ?
>
> Thanks,
> Phong X. Nguyen
>
> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <[email protected]> wrote:
>
>> Hi Phong,
>>
>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>> Will this test involve enable/disable operation? If yes, it could be a bug
>> that was caused in 1.0.3, which leads to the instance being disabled
>> through batch enable/disable. One thing you can verify: check the Cluster
>> Config to see in map fields of disabled instances whether they contain the
>> instance coming back.
>>
>> We are working on the 1.0.4 version to fix that.
>>
>> Best,
>>
>> Junkai
>>
>>
>>
>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <[email protected]>
>> wrote:
>>
>>> Helix Team,
>>>
>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>> rebalancing in a slightly different way than before:
>>>
>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>> configuration is as follows:
>>>
>>> "simpleFields" : {
>>>     "HELIX_ENABLED" : "true",
>>>     "NUM_PARTITIONS" : "32",
>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>     "REBALANCER_CLASS_NAME" :
>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>     "REPLICAS" : "1",
>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>   }
>>>
>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>> 'server01' and 'server02'.
>>>
>>> Previously, if we restarted the application on 'server01' in order to
>>> deploy some test code, Helix would move one of the partitions over to
>>> another host, and when 'server01' came back online the partition would be
>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>> stays with the other host and does not go back. While this is within the
>>> constraints of the max partitions, we're confused as to why this might
>>> happen now.
>>>
>>> Have there been any changes to WAGED that might account for this? The
>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>> Helix.
>>>
>>> Thanks,
>>> - Phong X. Nguyen
>>>
>>
>>
>> --
>> Junkai Xue
>>
>

-- 
Junkai Xue

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Reply via email to