Hi Phong, Thanks for leveraging Helix 1.0.3. I have a question for your testing. Will this test involve enable/disable operation? If yes, it could be a bug that was caused in 1.0.3, which leads to the instance being disabled through batch enable/disable. One thing you can verify: check the Cluster Config to see in map fields of disabled instances whether they contain the instance coming back.
We are working on the 1.0.4 version to fix that. Best, Junkai On Tue, Jun 7, 2022 at 6:50 PM Phong X. Nguyen <[email protected]> wrote: > Helix Team, > > We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for the > log4j2 fixes. As we test it, we're discovering that WAGED seems to be > rebalancing in a slightly different way than before: > > Our configuration has 32 instances and 32 partitions. The simpleFields > configuration is as follows: > > "simpleFields" : { > "HELIX_ENABLED" : "true", > "NUM_PARTITIONS" : "32", > "MAX_PARTITIONS_PER_INSTANCE" : "4", > "DELAY_REBALANCE_ENABLE" : "true", > "DELAY_REBALANCE_TIME" : "30000", > "REBALANCE_MODE" : "FULL_AUTO", > "REBALANCER_CLASS_NAME" : > "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", > "REPLICAS" : "1", > "STATE_MODEL_DEF_REF" : "OnlineOffline", > "STATE_MODEL_FACTORY_NAME" : "DEFAULT" > } > > Out of the 32 instances, we have 2 production test servers, e.g. > 'server01' and 'server02'. > > Previously, if we restarted the application on 'server01' in order to > deploy some test code, Helix would move one of the partitions over to > another host, and when 'server01' came back online the partition would be > rebalanced back. Currently we are not seeing his behavior; the partition > stays with the other host and does not go back. While this is within the > constraints of the max partitions, we're confused as to why this might > happen now. > > Have there been any changes to WAGED that might account for this? The > release notes mentioned that both 1.0.2 and 1.0.3 made some changes to > Helix. > > Thanks, > - Phong X. Nguyen > -- Junkai Xue
