Interesting. Is this reproducible? We can have a try on your data. Best,
Junkai On Sat, Jun 18, 2022 at 4:31 AM Grainier Perera <[email protected]> wrote: > Hi Junkai, > > I tried removing `MAX_PARTITIONS_PER_INSTANCE`. But it's still the same. > What's weird is, when I add a few resources, I see some of them still not > getting into the `ONLINE` state. In the below sample, you can see only the > 2nd and 4th resources have proper `mapFields`, whereas the 1st and 3rd > resources don't seem to have any mapping (all of them have the > same IdealState). However, after a restart, this can change to 1 & 3 > becomes `ONLINE` and 2 & 3 may lose their mapping. But the pattern > remains... cannot understand why. > > > *ExternalView for _mm:root:_system:cron1:*{ > "id" : "_mm:root:_system:cron1", > "simpleFields" : { > "BUCKET_SIZE" : "0", > "DELAY_REBALANCE_ENABLED" : "true", > "IDEAL_STATE_MODE" : "AUTO_REBALANCE", > "NUM_PARTITIONS" : "1", > "REBALANCER_CLASS_NAME" : > "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", > "REBALANCE_DELAY" : "10000", > "REBALANCE_MODE" : "FULL_AUTO", > "REPLICAS" : "1", > "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" > }, > *"mapFields" : { },* > "listFields" : { } > } > > > *ExternalView for _mm:root:_system:cron2:*{ > "id" : "_mm:root:_system:cron2", > "simpleFields" : { > "BUCKET_SIZE" : "0", > "DELAY_REBALANCE_ENABLED" : "true", > "IDEAL_STATE_MODE" : "AUTO_REBALANCE", > "NUM_PARTITIONS" : "1", > "REBALANCER_CLASS_NAME" : > "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", > "REBALANCE_DELAY" : "10000", > "REBALANCE_MODE" : "FULL_AUTO", > "REPLICAS" : "1", > "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" > }, > > > > > * "mapFields" : { "_mm:root:_system:cron2_0" : { > "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE" } },* > "listFields" : { } > } > > > *ExternalView for _mm:root:_system:cron3:*{ > "id" : "_mm:root:_system:cron3", > "simpleFields" : { > "BUCKET_SIZE" : "0", > "DELAY_REBALANCE_ENABLED" : "true", > "IDEAL_STATE_MODE" : "AUTO_REBALANCE", > "NUM_PARTITIONS" : "1", > "REBALANCER_CLASS_NAME" : > "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", > "REBALANCE_DELAY" : "10000", > "REBALANCE_MODE" : "FULL_AUTO", > "REPLICAS" : "1", > "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" > }, > *"mapFields" : { },* > "listFields" : { } > } > > > *ExternalView for _mm:root:_system:cron4:*{ > "id" : "_mm:root:_system:cron4", > "simpleFields" : { > "BUCKET_SIZE" : "0", > "DELAY_REBALANCE_ENABLED" : "true", > "IDEAL_STATE_MODE" : "AUTO_REBALANCE", > "NUM_PARTITIONS" : "1", > "REBALANCER_CLASS_NAME" : > "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", > "REBALANCE_DELAY" : "10000", > "REBALANCE_MODE" : "FULL_AUTO", > "REPLICAS" : "1", > "STATE_MODEL_DEF_REF" : "NewC8CEPStateModel" > }, > > > > > * "mapFields" : { "_mm:root:_system:cron4_0" : { > "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE" } },* > "listFields" : { } > } > > Thanks, > Grainier Perera. > > > On Sat, 18 Jun 2022 at 13:21, Junkai Xue <[email protected]> wrote: > >> Then most likely, it caused by this entry of config: >> "MAX_PARTITIONS_PER_INSTANCE" : "1", >> Usually, we never set this config up. It restricts the assignment for >> instance. So now you already have one partition from 3_0 assigned. No other >> partition can be assigned. >> >> So either you remove this entry of config setup or add more instances may >> help. >> >> Please let us know if you have further questions. >> >> best, >> >> Junkai >> >> On Fri, Jun 17, 2022 at 11:38 PM Grainier Perera <[email protected]> >> wrote: >> >>> Hi Junkai, >>> >>> - Correct. I haven't added any rack-aware information. >>> - I'm connecting 1 instance at the startup and then expanding on-demand >>> (I've set ALLOW_PARTICIPANT_AUTO_JOIN to true). >>> - I've checked the live instances and other znodes in Zookeeper. >>> Everything looks ok, except >>> /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron2 has empty `mapFields` >>> while /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron3 has `mapFields` >>> with a ONLINE record. I still cannot understand why? and what I'm doing >>> wrong :( >>> >>> >>> *[zk: localhost:2181(CONNECTED) 18] get >>> /C8CEPCluster/CONFIGS/CLUSTER/C8CEPCluster*{ >>> "id" : "C8CEPCluster", >>> "simpleFields" : { >>> "allowParticipantAutoJoin" : "true" >>> }, >>> "mapFields" : { >>> "DEFAULT_INSTANCE_CAPACITY_MAP" : { >>> "MEMORY" : "100", >>> "CPU" : "100" >>> }, >>> "DEFAULT_PARTITION_WEIGHT_MAP" : { >>> "MEMORY" : "5", >>> "CPU" : "5" >>> } >>> }, >>> "listFields" : { >>> "INSTANCE_CAPACITY_KEYS" : [ "CPU", "MEMORY" ] >>> } >>> } >>> >>> *[zk: localhost:2181(CONNECTED) 8] get >>> /C8CEPCluster/LIVEINSTANCES/c8cep-0.c8cep.c8.svc.cluster.local_12000*{ >>> "id" : "c8cep-0.c8cep.c8.svc.cluster.local_12000", >>> "simpleFields" : { >>> "CURRENT_TASK_THREAD_POOL_SIZE" : "40", >>> "HELIX_VERSION" : "1.0.4", >>> "LIVE_INSTANCE" : "[email protected]", >>> "SESSION_ID" : "106a30539a8003e" >>> }, >>> "mapFields" : { }, >>> "listFields" : { } >>> } >>> [zk: localhost:2181(CONNECTED) 26] get >>> /C8CEPCluster/CONFIGS/RESOURCE/_mm:root:_system:cron2 >>> { >>> "id" : "_mm:root:_system:cron2", >>> "simpleFields" : { }, >>> "mapFields" : { >>> "PARTITION_CAPACITY_MAP" : { >>> "DEFAULT" : "{\"CPU\":\"10\",\"MEMORY\":\"10\"}" >>> } >>> }, >>> "listFields" : { } >>> } >>> >>> *[zk: localhost:2181(CONNECTED) 27] get >>> /C8CEPCluster/CONFIGS/RESOURCE/_mm:root:_system:cron3*{ >>> "id" : "_mm:root:_system:cron3", >>> "simpleFields" : { }, >>> "mapFields" : { >>> "PARTITION_CAPACITY_MAP" : { >>> "DEFAULT" : "{\"CPU\":\"10\",\"MEMORY\":\"10\"}" >>> } >>> }, >>> "listFields" : { } >>> } >>> >>> *[zk: localhost:2181(CONNECTED) 38] get >>> /C8CEPCluster/IDEALSTATES/_mm:root:_system:cron2*{ >>> "id" : "_mm:root:_system:cron2", >>> "simpleFields" : { >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>> }, >>> "mapFields" : { >>> "_mm:root:_system:cron2_0" : { } >>> }, >>> "listFields" : { >>> "_mm:root:_system:cron2_0" : [ ] >>> } >>> } >>> >>> *[zk: localhost:2181(CONNECTED) 39] get >>> /C8CEPCluster/IDEALSTATES/_mm:root:_system:cron3*{ >>> "id" : "_mm:root:_system:cron3", >>> "simpleFields" : { >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>> }, >>> "mapFields" : { >>> "_mm:root:_system:cron3_0" : { } >>> }, >>> "listFields" : { >>> "_mm:root:_system:cron3_0" : [ ] >>> } >>> } >>> >>> *[zk: localhost:2181(CONNECTED) 42] get >>> /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron2*{ >>> "id" : "_mm:root:_system:cron2", >>> "simpleFields" : { >>> "BUCKET_SIZE" : "0", >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>> }, >>> *"mapFields" : { },* >>> "listFields" : { } >>> } >>> >>> *[zk: localhost:2181(CONNECTED) 43] get >>> /C8CEPCluster/EXTERNALVIEW/_mm:root:_system:cron3*{ >>> "id" : "_mm:root:_system:cron3", >>> "simpleFields" : { >>> "BUCKET_SIZE" : "0", >>> "DELAY_REBALANCE_ENABLED" : "true", >>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>> "NUM_PARTITIONS" : "1", >>> "REBALANCER_CLASS_NAME" : >>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>> "REBALANCE_DELAY" : "10000", >>> "REBALANCE_MODE" : "FULL_AUTO", >>> "REPLICAS" : "1", >>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>> }, >>> >>> >>> >>> >>> *"mapFields" : { "_mm:root:_system:cron3_0" : { >>> "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE" } }*, >>> "listFields" : { } >>> } >>> >>> Thank you. >>> Grainier Perera. >>> >>> >>> On Sat, 18 Jun 2022 at 10:45, Junkai Xue <[email protected]> wrote: >>> >>>> OK. So you dont put any rackaware information. Then how many instances >>>> do you have connecting to that cluster? Please double check the live >>>> instances in Zookeeper as well. >>>> >>>> Best, >>>> >>>> Junkai >>>> >>>> On Fri, Jun 17, 2022 at 10:01 PM Grainier Perera <[email protected]> >>>> wrote: >>>> >>>>> Hi Junkai, >>>>> >>>>> I've added cluster init code to the gist [1]. Apart from that, >>>>> ClusterConfig is configured like this; >>>>> >>>>> ClusterConfig clusterConfig = >>>>> configAccessor.getClusterConfig(CLUSTER_NAME); >>>>> // Configuring the capacity keys in the Cluster Config. >>>>> For example, MEMORY. >>>>> >>>>> clusterConfig.setInstanceCapacityKeys(INSTANCE_CAPACITY_KEYS); >>>>> // Configuring the instance capacity in the Instance >>>>> Config. For example, MEMORY = 100. >>>>> >>>>> clusterConfig.setDefaultInstanceCapacityMap(INSTANCE_CAPACITY); >>>>> // Configuring the partition weight in the Resource >>>>> Config. For example, MEMORY = 5. >>>>> >>>>> clusterConfig.setDefaultPartitionWeightMap(DEFAULT_RESOURCE_USAGE); >>>>> configAccessor.setClusterConfig(CLUSTER_NAME, >>>>> clusterConfig); >>>>> >>>>> [1] >>>>> https://gist.github.com/grainier/aa1c0b279ea99f88d74c1e94d79f5cdb#file-clustersetup-java >>>>> >>>>> Thanks, >>>>> Grainier Perera. >>>>> >>>>> >>>>> On Sat, 18 Jun 2022 at 10:00, Junkai Xue <[email protected]> wrote: >>>>> >>>>>> Could you please share your cluster config as well? >>>>>> >>>>>> Best, >>>>>> >>>>>> Junkai >>>>>> >>>>>> On Fri, Jun 17, 2022 at 8:24 PM Grainier Perera <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Devs, >>>>>>> >>>>>>> I'm trying to add several resources to the cluster using the >>>>>>> following configurations[1]. However, only some will become `ONLINE`. >>>>>>> What >>>>>>> could be the reason? Is there a way to guarantee every resource will >>>>>>> become >>>>>>> `ONLINE` if WAGED capacity constraints are met? >>>>>>> >>>>>>> You can see with the same IdealState, "_mm:root:_system:cron3" has >>>>>>> mapFields and it is ONLINE, and "_mm:root:_system:cron2" is not. >>>>>>> Furthermore, I see this behavior more often when the replicas count is >>>>>>> set >>>>>>> to 1. >>>>>>> >>>>>>> ResourceInfo: >>>>>>> 1. "_mm:root:_system:cron2" >>>>>>> >>>>>>> IdealState for _mm:root:_system:cron2: >>>>>>> { >>>>>>> "id" : "_mm:root:_system:cron2", >>>>>>> "simpleFields" : { >>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>> "NUM_PARTITIONS" : "1", >>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>> "REPLICAS" : "1", >>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>> }, >>>>>>> "mapFields" : { >>>>>>> "_mm:root:_system:cron2_0" : { } >>>>>>> }, >>>>>>> "listFields" : { >>>>>>> "_mm:root:_system:cron2_0" : [ ] >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> ExternalView for _mm:root:_system:cron2: >>>>>>> { >>>>>>> "id" : "_mm:root:_system:cron2", >>>>>>> "simpleFields" : { >>>>>>> "BUCKET_SIZE" : "0", >>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>> "NUM_PARTITIONS" : "1", >>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>> "REPLICAS" : "1", >>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>> }, >>>>>>> *"mapFields" : { },* >>>>>>> "listFields" : { } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> 2. "_mm:root:_system:cron3" >>>>>>> >>>>>>> IdealState for _mm:root:_system:cron3: >>>>>>> { >>>>>>> "id" : "_mm:root:_system:cron3", >>>>>>> "simpleFields" : { >>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>> "NUM_PARTITIONS" : "1", >>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>> "REPLICAS" : "1", >>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>> }, >>>>>>> "mapFields" : { >>>>>>> "_mm:root:_system:cron3_0" : { } >>>>>>> }, >>>>>>> "listFields" : { >>>>>>> "_mm:root:_system:cron3_0" : [ ] >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> ExternalView for _mm:root:_system:cron3: >>>>>>> { >>>>>>> "id" : "_mm:root:_system:cron3", >>>>>>> "simpleFields" : { >>>>>>> "BUCKET_SIZE" : "0", >>>>>>> "DELAY_REBALANCE_ENABLED" : "true", >>>>>>> "IDEAL_STATE_MODE" : "AUTO_REBALANCE", >>>>>>> "MAX_PARTITIONS_PER_INSTANCE" : "1", >>>>>>> "NUM_PARTITIONS" : "1", >>>>>>> "REBALANCER_CLASS_NAME" : >>>>>>> "org.apache.helix.controller.rebalancer.waged.WagedRebalancer", >>>>>>> "REBALANCE_DELAY" : "10000", >>>>>>> "REBALANCE_MODE" : "FULL_AUTO", >>>>>>> "REPLICAS" : "1", >>>>>>> "STATE_MODEL_DEF_REF" : "C8CEPStateModel" >>>>>>> }, >>>>>>> *"mapFields" : {* >>>>>>> * "_mm:root:_system:cron3_0" : {* >>>>>>> * "c8cep-0.c8cep.c8.svc.cluster.local_12000" : "ONLINE"* >>>>>>> * }* >>>>>>> * },* >>>>>>> "listFields" : { } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> [1]: >>>>>>> https://gist.github.com/grainier/aa1c0b279ea99f88d74c1e94d79f5cdb >>>>>>> >>>>>>> Thank you. >>>>>>> Grainier Perera. >>>>>>> >>>>>> >> >> -- >> Junkai Xue >> >
