Hi Junkai, This is reproducible. Please find the sample code [1]. With this sample;
- Initially, I'm creating a cluster with 3 instances (Using OOTB `OnlineOfflineStateModelFactory` and WAGED rebalancer...) - Step 1: Adds 6 different resources to the cluster with 1 partition and 1 replica each. - Step 2: Adds an additional instance to the cluster. - Step 3: Removes an existing instance from the cluster. - Step 4: Remove all resources. However, after Step 1, you can see resource1 and resource2 is not getting assigned to any Instance. c8cep_on_localhost_12000 c8cep_on_localhost_12001 c8cep_on_localhost_12002 resource1 - - - resource2 ONLINE - - resource3 - ONLINE resource4 - - ONLINE resource5 - - - resource6 ONLINE - - After other steps also, not every resource is getting rebalanced properly. [1] https://gist.github.com/grainier/055511179d8b4a4f0c678f17889ed853 Thanks, Grainier Perera. On Sun, 19 Jun 2022 at 08:32, Junkai Xue <[email protected]> wrote: > BTW, have you setup proper capacity in InstanceConfig of the only instance? > > Best, > > Junkai > > On Sat, Jun 18, 2022 at 7:10 PM Junkai Xue <[email protected]> wrote: > >> Interesting. Is this reproducible? We can have a try on your data. >> >> Best, >> >> Junkai >> >> On Sat, Jun 18, 2022 at 4:31 AM Grainier Perera <[email protected]> >> wrote: >> >>> Hi Junkai, >>> >>> I tried removing `MAX_PARTITIONS_PER_INSTANCE`. But it's still the >>> same. What's weird is, when I add a few resources, I see some of them still >>> not getting into the `ONLINE` state. In the below sample, you can see only >>> the 2nd and 4th resources have proper `mapFields`, whereas the 1st and 3rd >>> resources don't seem to have any mapping (all of them have the >>> same IdealState). However, after a restart, this can change to 1 & 3 >>> becomes `ONLINE` and 2 & 3 may lose their mapping. But the pattern >>> remains... cannot understand why. >>> >>> >>> *ExternalView for _mm:root:_system:cron1:*{ >>> "id" : "_mm:root:_system:cron1", >>> "simpleFields" : { >>> "BUCKET_SIZE" : "0", >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" >>> }, >>> *"mapFields" : { },* >>> "listFields" : { } >>> } >>> >>> >>> *ExternalView for _mm:root:_system:cron2:*{ >>> "id" : "_mm:root:_system:cron2", >>> "simpleFields" : { >>> "BUCKET_SIZE" : "0", >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" >>> }, >>> >>> >>> >>> >>> * "mapFields" : { "_mm:root:_system:cron2_0" : { >>> "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE" } },* >>> "listFields" : { } >>> } >>> >>> >>> *ExternalView for _mm:root:_system:cron3:*{ >>> "id" : "_mm:root:_system:cron3", >>> "simpleFields" : { >>> "BUCKET_SIZE" : "0", >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" >>> }, >>> *"mapFields" : { },* >>> "listFields" : { } >>> } >>> >>> >>> *ExternalView for _mm:root:_system:cron4:*{ >>> "id" : "_mm:root:_system:cron4", >>> "simpleFields" : { >>> "BUCKET_SIZE" : "0", >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" >>> }, >>> >>> >>> >>> >>> * "mapFields" : { "_mm:root:_system:cron4_0" : { >>> "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE" } },* >>> "listFields" : { } >>> } >>> >>> Thanks, >>> Grainier Perera. >>> >>> >>> On Sat, 18 Jun 2022 at 13:21, Junkai Xue <[email protected]> wrote: >>> >>>> Then most likely, it caused by this entry of config: >>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>> Usually, we never set this config up. It restricts the assignment for >>>> instance. So now you already have one partition from 3_0 assigned. No other >>>> partition can be assigned. >>>> >>>> So either you remove this entry of config setup or add more instances >>>> may help. >>>> >>>> Please let us know if you have further questions. >>>> >>>> best, >>>> >>>> Junkai >>>> >>>> On Fri, Jun 17, 2022 at 11:38 PM Grainier Perera <[email protected]> >>>> wrote: >>>> >>>>> Hi Junkai, >>>>> >>>>> - Correct. I haven't added any rack-aware information. >>>>> - I'm connecting 1 instance at the startup and then expanding >>>>> on-demand (I've set ALLOW_PARTICIPANT_AUTO_JOIN to true). >>>>> - I've checked the live instances and other znodes in Zookeeper. >>>>> Everything looks ok, except >>>>> /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron2 has empty >>>>> `mapFields` while /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron3 >>>>> has `mapFields` with a ONLINE record. I still cannot understand why? and >>>>> what I'm doing wrong :( >>>>> >>>>> >>>>> *[zk: localhost:2181(CONNECTED) 18] get >>>>> /C8CEPCluster/CONFIGS/CLUSTER/C8CEPCluster*{ >>>>> "id" : "C8CEPCluster", >>>>> "simpleFields" : { >>>>> "allowParticipantAutoJoin" : "true" >>>>> }, >>>>> "mapFields" : { >>>>> "DEFAULT_INSTANCE_CAPACITY_MAP" : { >>>>> "MEMORY" : "100", >>>>> "CPU" : "100" >>>>> }, >>>>> "DEFAULT_PARTITION_WEIGHT_MAP" : { >>>>> "MEMORY" : "5", >>>>> "CPU" : "5" >>>>> } >>>>> }, >>>>> "listFields" : { >>>>> "INSTANCE_CAPACITY_KEYS" : [ "CPU", "MEMORY" ] >>>>> } >>>>> } >>>>> >>>>> *[zk: localhost:2181(CONNECTED) 8] get >>>>> /C8CEPCluster/LIVEINSTANCES/c8cep-0.c8cep.c8.svc.cluster.local_12000*{ >>>>> "id" : "c8cep-0.c8cep.c8.svc.cluster.local_12000", >>>>> "simpleFields" : { >>>>> "CURRENT_TASK_THREAD_POOL_SIZE" : "40", >>>>> "HELIX_VERSION" : "1.0.4", >>>>> "LIVE_INSTANCE" : "[email protected]", >>>>> "SESSION_ID" : "106a30539a8003e" >>>>> }, >>>>> "mapFields" : { }, >>>>> "listFields" : { } >>>>> } >>>>> [zk: localhost:2181(CONNECTED) 26] get >>>>> /C8CEPCluster/CONFIGS/RESOURCE/_mm:root:_system:cron2 >>>>> { >>>>> "id" : "_mm:root:_system:cron2", >>>>> "simpleFields" : { }, >>>>> "mapFields" : { >>>>> "PARTITION_CAPACITY_MAP" : { >>>>> "DEFAULT" : "{\"CPU\":\"10\",\"MEMORY\":\"10\"}" >>>>> } >>>>> }, >>>>> "listFields" : { } >>>>> } >>>>> >>>>> *[zk: localhost:2181(CONNECTED) 27] get >>>>> /C8CEPCluster/CONFIGS/RESOURCE/_mm:root:_system:cron3*{ >>>>> "id" : "_mm:root:_system:cron3", >>>>> "simpleFields" : { }, >>>>> "mapFields" : { >>>>> "PARTITION_CAPACITY_MAP" : { >>>>> "DEFAULT" : "{\"CPU\":\"10\",\"MEMORY\":\"10\"}" >>>>> } >>>>> }, >>>>> "listFields" : { } >>>>> } >>>>> >>>>> *[zk: localhost:2181(CONNECTED) 38] get >>>>> /C8CEPCluster/IDEALSTATES/_mm:root:_system:cron2*{ >>>>> "id" : "_mm:root:_system:cron2", >>>>> "simpleFields" : { >>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>> "NUM_PARTITIONS" : "1", >>>>> "REBALANCER_CLASS_NAME" : >>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>> "REBALANCE_DELAY" : "10000", >>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>> "REPLICAS" : "1", >>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>> }, >>>>> "mapFields" : { >>>>> "_mm:root:_system:cron2_0" : { } >>>>> }, >>>>> "listFields" : { >>>>> "_mm:root:_system:cron2_0" : [ ] >>>>> } >>>>> } >>>>> >>>>> *[zk: localhost:2181(CONNECTED) 39] get >>>>> /C8CEPCluster/IDEALSTATES/_mm:root:_system:cron3*{ >>>>> "id" : "_mm:root:_system:cron3", >>>>> "simpleFields" : { >>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>> "NUM_PARTITIONS" : "1", >>>>> "REBALANCER_CLASS_NAME" : >>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>> "REBALANCE_DELAY" : "10000", >>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>> "REPLICAS" : "1", >>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>> }, >>>>> "mapFields" : { >>>>> "_mm:root:_system:cron3_0" : { } >>>>> }, >>>>> "listFields" : { >>>>> "_mm:root:_system:cron3_0" : [ ] >>>>> } >>>>> } >>>>> >>>>> *[zk: localhost:2181(CONNECTED) 42] get >>>>> /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron2*{ >>>>> "id" : "_mm:root:_system:cron2", >>>>> "simpleFields" : { >>>>> "BUCKET_SIZE" : "0", >>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>> "NUM_PARTITIONS" : "1", >>>>> "REBALANCER_CLASS_NAME" : >>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>> "REBALANCE_DELAY" : "10000", >>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>> "REPLICAS" : "1", >>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>> }, >>>>> *"mapFields" : { },* >>>>> "listFields" : { } >>>>> } >>>>> >>>>> *[zk: localhost:2181(CONNECTED) 43] get >>>>> /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron3*{ >>>>> "id" : "_mm:root:_system:cron3", >>>>> "simpleFields" : { >>>>> "BUCKET_SIZE" : "0", >>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>> "NUM_PARTITIONS" : "1", >>>>> "REBALANCER_CLASS_NAME" : >>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>> "REBALANCE_DELAY" : "10000", >>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>> "REPLICAS" : "1", >>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>> }, >>>>> >>>>> >>>>> >>>>> >>>>> *"mapFields" : { "_mm:root:_system:cron3_0" : { >>>>> "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE" } }*, >>>>> "listFields" : { } >>>>> } >>>>> >>>>> Thank you. >>>>> Grainier Perera. >>>>> >>>>> >>>>> On Sat, 18 Jun 2022 at 10:45, Junkai Xue <[email protected]> wrote: >>>>> >>>>>> OK. So you dont put any rackaware information. Then how many >>>>>> instances do you have connecting to that cluster? Please double check the >>>>>> live instances in Zookeeper as well. >>>>>> >>>>>> Best, >>>>>> >>>>>> Junkai >>>>>> >>>>>> On Fri, Jun 17, 2022 at 10:01 PM Grainier Perera <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Junkai, >>>>>>> >>>>>>> I've added cluster init code to the gist [1]. Apart from that, >>>>>>> ClusterConfig is configured like this; >>>>>>> >>>>>>> ClusterConfig clusterConfig = >>>>>>> configAccessor.getClusterConfig(CLUSTER_NAME); >>>>>>> // Configuring the capacity keys in the Cluster Config. >>>>>>> For example, MEMORY. >>>>>>> >>>>>>> clusterConfig.setInstanceCapacityKeys(INSTANCE_CAPACITY_KEYS); >>>>>>> // Configuring the instance capacity in the Instance >>>>>>> Config. For example, MEMORY = 100. >>>>>>> >>>>>>> clusterConfig.setDefaultInstanceCapacityMap(INSTANCE_CAPACITY); >>>>>>> // Configuring the partition weight in the Resource >>>>>>> Config. For example, MEMORY = 5. >>>>>>> >>>>>>> clusterConfig.setDefaultPartitionWeightMap(DEFAULT_RESOURCE_USAGE); >>>>>>> configAccessor.setClusterConfig(CLUSTER_NAME, >>>>>>> clusterConfig); >>>>>>> >>>>>>> [1] >>>>>>> https://gist.github.com/grainier/aa1c0b279ea99f88d74c1e94d79f5cdb#file-clustersetup-java >>>>>>> >>>>>>> Thanks, >>>>>>> Grainier Perera. >>>>>>> >>>>>>> >>>>>>> On Sat, 18 Jun 2022 at 10:00, Junkai Xue <[email protected]> wrote: >>>>>>> >>>>>>>> Could you please share your cluster config as well? >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Junkai >>>>>>>> >>>>>>>> On Fri, Jun 17, 2022 at 8:24 PM Grainier Perera < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Devs, >>>>>>>>> >>>>>>>>> I'm trying to add several resources to the cluster using the >>>>>>>>> following configurations[1]. However, only some will become `ONLINE`. >>>>>>>>> What >>>>>>>>> could be the reason? Is there a way to guarantee every resource will >>>>>>>>> become >>>>>>>>> `ONLINE` if WAGED capacity constraints are met? >>>>>>>>> >>>>>>>>> You can see with the same IdealState, "_mm:root:_system:cron3" has >>>>>>>>> mapFields and it is ONLINE, and "_mm:root:_system:cron2" is not. >>>>>>>>> Furthermore, I see this behavior more often when the replicas count >>>>>>>>> is set >>>>>>>>> to 1. >>>>>>>>> >>>>>>>>> ResourceInfo: >>>>>>>>> 1. "_mm:root:_system:cron2" >>>>>>>>> >>>>>>>>> IdealState for _mm:root:_system:cron2: >>>>>>>>> { >>>>>>>>> "id" : "_mm:root:_system:cron2", >>>>>>>>> "simpleFields" : { >>>>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>>>> "NUM_PARTITIONS" : "1", >>>>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>>>> "REPLICAS" : "1", >>>>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>>>> }, >>>>>>>>> "mapFields" : { >>>>>>>>> "_mm:root:_system:cron2_0" : { } >>>>>>>>> }, >>>>>>>>> "listFields" : { >>>>>>>>> "_mm:root:_system:cron2_0" : [ ] >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> ExternalView for _mm:root:_system:cron2: >>>>>>>>> { >>>>>>>>> "id" : "_mm:root:_system:cron2", >>>>>>>>> "simpleFields" : { >>>>>>>>> "BUCKET_SIZE" : "0", >>>>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>>>> "NUM_PARTITIONS" : "1", >>>>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>>>> "REPLICAS" : "1", >>>>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>>>> }, >>>>>>>>> *"mapFields" : { },* >>>>>>>>> "listFields" : { } >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> 2. "_mm:root:_system:cron3" >>>>>>>>> >>>>>>>>> IdealState for _mm:root:_system:cron3: >>>>>>>>> { >>>>>>>>> "id" : "_mm:root:_system:cron3", >>>>>>>>> "simpleFields" : { >>>>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>>>> "NUM_PARTITIONS" : "1", >>>>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>>>> "REPLICAS" : "1", >>>>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>>>> }, >>>>>>>>> "mapFields" : { >>>>>>>>> "_mm:root:_system:cron3_0" : { } >>>>>>>>> }, >>>>>>>>> "listFields" : { >>>>>>>>> "_mm:root:_system:cron3_0" : [ ] >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> ExternalView for _mm:root:_system:cron3: >>>>>>>>> { >>>>>>>>> "id" : "_mm:root:_system:cron3", >>>>>>>>> "simpleFields" : { >>>>>>>>> "BUCKET_SIZE" : "0", >>>>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>>>> "NUM_PARTITIONS" : "1", >>>>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>>>> "REPLICAS" : "1", >>>>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>>>> }, >>>>>>>>> *"mapFields" : {* >>>>>>>>> * "_mm:root:_system:cron3_0" : {* >>>>>>>>> * "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE"* >>>>>>>>> * }* >>>>>>>>> * },* >>>>>>>>> "listFields" : { } >>>>>>>>> } >>>>>>>>> >>>>>>>>> >>>>>>>>> [1]: >>>>>>>>> https://gist.github.com/grainier/aa1c0b279ea99f88d74c1e94d79f5cdb >>>>>>>>> >>>>>>>>> Thank you. >>>>>>>>> Grainier Perera. >>>>>>>>> >>>>>>>> >>>> >>>> -- >>>> Junkai Xue >>>> >>>
