I don't normally have direct access to the zookeeper cluster itself; I'll see if we can get our production engineers to dump that ZNode when we're testing it again.
On Tue, Jun 7, 2022 at 7:29 PM Junkai Xue <[email protected]> wrote: > What I mean is the ZNode inside the Zookeeper under path of /[your cluster > name]/CONFIGS/CLUSTER/[your cluster name] > > Best, > > Junkai > > On Tue, Jun 7, 2022 at 7:24 PM Phong X. Nguyen <[email protected]> > wrote: > >> Yes, it involves enable/disable operations as the server comes up and >> down. In the logs we would sometimes not see the host in the "Current quota >> capacity" log message, either. >> >> When you refer to Cluster Config, did you mean what's accessible by >> "-listClusterInfo helix-ctrl" ? >> >> Thanks, >> Phong X. Nguyen >> >> On Tue, Jun 7, 2022 at 7:19 PM Junkai Xue <[email protected]> wrote: >> >>> Hi Phong, >>> >>> Thanks for leveraging Helix 1.0.3. I have a question for your testing. >>> Will this test involve enable/disable operation? If yes, it could be a bug >>> that was caused in 1.0.3, which leads to the instance being disabled >>> through batch enable/disable. One thing you can verify: check the Cluster >>> Config to see in map fields of disabled instances whether they contain the >>> instance coming back. >>> >>> We are working on the 1.0.4 version to fix that. >>> >>> Best, >>> >>> Junkai >>> >>> >>> >>> On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <[email protected]> >>> wrote: >>> >>>> Helix Team, >>>> >>>> We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for >>>> the log4j2 fixes. As we test it, we're discovering that WAGED seems to be >>>> rebalancing in a slightly different way than before: >>>> >>>> Our configuration has 32 instances and 32 partitions. The simpleFields >>>> configuration is as follows: >>>> >>>> "simpleFields" : { >>>> "HELIX_ENABLED" : "true", >>>> "NUM_PARTITIONS" : "32", >>>> "MAX_PARTITIONS_PER_INSTANCE" : "4", >>>> "DELAY_REBALANCE_ENABLE" : "true", >>>> "DELAY_REBALANCE_TIME" : "30000", >>>> "REBALANCE_MODE" : "FULL_AUTO", >>>> "REBALANCER_CLASS_NAME" : >>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>> "REPLICAS" : "1", >>>> "STATE_MODEL_DEF_REF" : "OnlineOffline", >>>> "STATE_MODEL_FACTORY_NAME" : "DEFAULT" >>>> } >>>> >>>> Out of the 32 instances, we have 2 production test servers, e.g. >>>> 'server01' and 'server02'. >>>> >>>> Previously, if we restarted the application on 'server01' in order to >>>> deploy some test code, Helix would move one of the partitions over to >>>> another host, and when 'server01' came back online the partition would be >>>> rebalanced back. Currently we are not seeing his behavior; the partition >>>> stays with the other host and does not go back. While this is within the >>>> constraints of the max partitions, we're confused as to why this might >>>> happen now. >>>> >>>> Have there been any changes to WAGED that might account for this? The >>>> release notes mentioned that both 1.0.2 and 1.0.3 made some changes to >>>> Helix. >>>> >>>> Thanks, >>>> - Phong X. Nguyen >>>> >>> >>> >>> -- >>> Junkai Xue >>> >> > > -- > Junkai Xue >
