Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Phong X. Nguyen Tue, 07 Jun 2022 19:36:02 -0700

I don't normally have direct access to the zookeeper cluster itself; I'll
see if we can get our production engineers to dump that ZNode when we're
testing it again.


On Tue, Jun 7, 2022 at 7:29 PM Junkai Xue <[email protected]> wrote:

> What I mean is the ZNode inside the Zookeeper under path of /[your cluster
> name]/CONFIGS/CLUSTER/[your cluster name]
>
> Best,
>
> Junkai
>
> On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <[email protected]>
> wrote:
>
>> Yes, it involves enable/disable operations as the server comes up and
>> down. In the logs we would sometimes not see the host in the "Current quota
>> capacity" log message, either.
>>
>> When you refer to Cluster Config, did you mean what's accessible by
>> "-listClusterInfo helix-ctrl" ?
>>
>> Thanks,
>> Phong X. Nguyen
>>
>> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <[email protected]> wrote:
>>
>>> Hi Phong,
>>>
>>> Thanks for leveraging Helix 1.0.3. I have a question for your testing.
>>> Will this test involve enable/disable operation? If yes, it could be a bug
>>> that was caused in 1.0.3, which leads to the instance being disabled
>>> through batch enable/disable. One thing you can verify: check the Cluster
>>> Config to see in map fields of disabled instances whether they contain the
>>> instance coming back.
>>>
>>> We are working on the 1.0.4 version to fix that.
>>>
>>> Best,
>>>
>>> Junkai
>>>
>>>
>>>
>>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <[email protected]>
>>> wrote:
>>>
>>>> Helix Team,
>>>>
>>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for
>>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be
>>>> rebalancing in a slightly different way than before:
>>>>
>>>> Our configuration has 32 instances and 32 partitions. The simpleFields
>>>> configuration is as follows:
>>>>
>>>> "simpleFields" : {
>>>>     "HELIX_ENABLED" : "true",
>>>>     "NUM_PARTITIONS" : "32",
>>>>     "MAX_PARTITIONS_PER_INSTANCE" : "4",
>>>>     "DELAY_REBALANCE_ENABLE" : "true",
>>>>     "DELAY_REBALANCE_TIME" : "30000",
>>>>     "REBALANCE_MODE" : "FULL_AUTO",
>>>>     "REBALANCER_CLASS_NAME" :
>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
>>>>     "REPLICAS" : "1",
>>>>     "STATE_MODEL_DEF_REF" : "OnlineOffline",
>>>>     "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
>>>>   }
>>>>
>>>> Out of the 32 instances, we have 2 production test servers, e.g.
>>>> 'server01' and 'server02'.
>>>>
>>>> Previously, if we restarted the application on 'server01' in order to
>>>> deploy some test code, Helix would move one of the partitions over to
>>>> another host, and when 'server01' came back online the partition would be
>>>> rebalanced back. Currently we are not seeing his behavior; the partition
>>>> stays with the other host and does not go back. While this is within the
>>>> constraints of the max partitions, we're confused as to why this might
>>>> happen now.
>>>>
>>>> Have there been any changes to WAGED that might account for this? The
>>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
>>>> Helix.
>>>>
>>>> Thanks,
>>>> - Phong X. Nguyen
>>>>
>>>
>>>
>>> --
>>> Junkai Xue
>>>
>>
>
> --
> Junkai Xue
>

Re: [E] Re: Changed behavior for WAGED in Helix 1.0.3?

Reply via email to