Helix Team,

We're testing an upgrade to Helix 1.0.3 from Helix 1.0.1 primarily for the
log4j2 fixes. As we test it, we're discovering that WAGED seems to be
rebalancing in a slightly different way than before:

Our configuration has 32 instances and 32 partitions. The simpleFields
configuration is as follows:

"simpleFields" : {
    "HELIX_ENABLED" : "true",
    "NUM_PARTITIONS" : "32",
    "MAX_PARTITIONS_PER_INSTANCE" : "4",
    "DELAY_REBALANCE_ENABLE" : "true",
    "DELAY_REBALANCE_TIME" : "30000",
    "REBALANCE_MODE" : "FULL_AUTO",
    "REBALANCER_CLASS_NAME" :
"org.apache.helix.controller.rebalancer.waged.WagedRebalancer",
    "REPLICAS" : "1",
    "STATE_MODEL_DEF_REF" : "OnlineOffline",
    "STATE_MODEL_FACTORY_NAME" : "DEFAULT"
  }

Out of the 32 instances, we have 2 production test servers, e.g. 'server01'
and 'server02'.

Previously, if we restarted the application on 'server01' in order to
deploy some test code, Helix would move one of the partitions over to
another host, and when 'server01' came back online the partition would be
rebalanced back. Currently we are not seeing his behavior; the partition
stays with the other host and does not go back. While this is within the
constraints of the max partitions, we're confused as to why this might
happen now.

Have there been any changes to WAGED that might account for this? The
release notes mentioned that both 1.0.2 and 1.0.3 made some changes to
Helix.

Thanks,
- Phong X. Nguyen

Reply via email to