[ 
https://issues.apache.org/jira/browse/YARN-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055509#comment-18055509
 ] 

ASF GitHub Bot commented on YARN-11924:
---------------------------------------

Hean-Chhinling commented on code in PR #8222:
URL: https://github.com/apache/hadoop/pull/8222#discussion_r2747564944


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*
   @VisibleForTesting
   protected byte[] getZkData(String path) throws Exception {
     return zkManager.getData(path);
   }
+*/
+
+  @VisibleForTesting
+  protected byte[] getZkData(String path) throws Exception {
+    // Should a 'yarn resourcemanager -format-state-store' command is issued
+    // while one of the RM is in a starting state, there is a time period
+    // when the /confstore/CONF_STORE path does not exist, hence the
+    // getZkData method returns a null value causing the RM to fail.
+    // To prevent this, added a re-try mechanism before giving up.
+    int maxRetries = 6;

Review Comment:
   What do you think of making the Max retries configurable like the 
sleepBetweenRetries?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*

Review Comment:
   I think we could delete the old getZkData method here instead of commented 
it out



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*
   @VisibleForTesting
   protected byte[] getZkData(String path) throws Exception {
     return zkManager.getData(path);
   }
+*/
+
+  @VisibleForTesting
+  protected byte[] getZkData(String path) throws Exception {
+    // Should a 'yarn resourcemanager -format-state-store' command is issued
+    // while one of the RM is in a starting state, there is a time period
+    // when the /confstore/CONF_STORE path does not exist, hence the
+    // getZkData method returns a null value causing the RM to fail.
+    // To prevent this, added a re-try mechanism before giving up.
+    int maxRetries = 6;
+    int attempt = 1;
+    int sleepBetweenRetries = conf.getInt(
+        YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS,
+        YarnConfiguration.DEFAULT_RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS);
+
+    while (attempt < maxRetries) {
+
+        if(zkManager.exists(path)) {
+          LOG.debug("zkManager.exists(path) {} exists.", path);
+          byte[] zkData = zkManager.getData(path);
+          // If we found data, return immediately

Review Comment:
   Maybe this comment is not needed and a little misleading.
   Because we also check if the data is not null and not empty?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java:
##########
@@ -1007,6 +1007,16 @@ public static boolean isAclEnabled(Configuration conf) {
   public static final String DEFAULT_RM_SCHEDCONF_STORE_ZK_PARENT_PATH =
       "/confstore";
 
+  /** Retry period for  ZKConfigurationStore will create znodes. */
+  @Private
+  @Unstable

Review Comment:
   I am just curious here.
   What does the "unstable" annotation means here and why do we it?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java:
##########
@@ -198,7 +198,27 @@ public void testDisableAuditLogs() throws Exception {
     prepareLogMutation("key1", "val1");
 
     data = ((ZKConfigurationStore) confStore).getZkData(logsPath);
-    assertNull(data, "Failed to Disable Audit Logs");
+    assertEquals("Failed to Disable Audit Logs", 0, data.length);
+  }
+
+  @Test
+  public void testZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);
+    String confStorePath = getZkPath("CONF_STORE");
+    byte[] data = ((ZKConfigurationStore) confStore).getZkData(confStorePath);
+    assertTrue("ConfStore data expected to be not null.", data.length > 0);
+  }
+
+  @Test
+  public void testEmptyZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);

Review Comment:
   What do you think of adding some data here before calling format? 
   So that we make sure the data is empty?



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java:
##########
@@ -198,7 +198,27 @@ public void testDisableAuditLogs() throws Exception {
     prepareLogMutation("key1", "val1");
 
     data = ((ZKConfigurationStore) confStore).getZkData(logsPath);
-    assertNull(data, "Failed to Disable Audit Logs");
+    assertEquals("Failed to Disable Audit Logs", 0, data.length);
+  }
+
+  @Test
+  public void testZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);
+    String confStorePath = getZkPath("CONF_STORE");
+    byte[] data = ((ZKConfigurationStore) confStore).getZkData(confStorePath);
+    assertTrue("ConfStore data expected to be not null.", data.length > 0);

Review Comment:
   I think it would be nice to add the configuration data here.
   Because it seems like we are checking against empty configuration 
[here](https://github.com/apache/hadoop/blob/90150aff3f84183cebaf21e977834b8a77e8ede7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ConfigurationStoreBaseTest.java#L50)
 
   
   E.g. setting data 
([code](https://github.com/apache/hadoop/blob/90150aff3f84183cebaf21e977834b8a77e8ede7/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java#L138))



##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*
   @VisibleForTesting
   protected byte[] getZkData(String path) throws Exception {
     return zkManager.getData(path);
   }
+*/
+
+  @VisibleForTesting
+  protected byte[] getZkData(String path) throws Exception {
+    // Should a 'yarn resourcemanager -format-state-store' command is issued
+    // while one of the RM is in a starting state, there is a time period
+    // when the /confstore/CONF_STORE path does not exist, hence the

Review Comment:
   The issue also occur when the /confstore/CONF_STORE path exists but the data 
is not fully written this also cause another RM to fail without the 
root.default queue. Thus retry and make sure the data is not empty.
   I think we could add the followings:
   
   "when the /confstore/CONF_STORE path does exists or when the data is not 
fully written"





> Add zkManager.exists(path) check to ZKConfigurationStore:getZkData() and 
> retry mechanism
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-11924
>                 URL: https://issues.apache.org/jira/browse/YARN-11924
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ferenc Erdelyi
>            Assignee: Ferenc Erdelyi
>            Priority: Major
>              Labels: pull-request-available
>
> Should a "yarn resourcemanager -format-state-store" command be issued while 
> one of the RM is starting and in the INIT state (because of YARN-11551), 
> there is a time period when the /confstore/CONF_STORE path does not exist, 
> hence the getZkData method returns a null value, causing the RM to fail. To 
> prevent this, add a check and re-try mechanism before giving up.
>  
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
> starting ResourceManagerorg.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues  
>     at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:875)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1293)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:334)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1580)Caused
>  by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize 
> queues at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:738)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:312)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:403)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)     
> ... 7 moreCaused by: java.lang.IllegalStateException: Queue configuration 
> missing child queue names for root    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateParent(CapacitySchedulerQueueManager.java:741)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:255)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(CapacitySchedulerQueueManager.java:177)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:729)
>       ... 10 more {code}
>  
> As a troubleshooting step, I've added locks, and this is the point when 
> noticed that the underlying issue is an NPE:
>  
> {code:java}
> 2026-01-26 16:19:54,961 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  ZK confstore is locked for reading. Readers can access it.2026-01-26 
> 16:19:54,962 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  Failed to retrieve configuration from zookeeper 
> storeorg.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /confstore/CONF_STORE    at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)        
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)  at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
>      at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
>      at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)        
> at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
>    at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
>      at 
> org.apache.hadoop.util.curator.ZKCuratorManager.getData(ZKCuratorManager.java:240)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.getZkData(ZKConfigurationStore.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.retrieve(ZKConfigurationStore.java:272)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.init(MutableCSConfigurationProvider.java:83)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:302)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)2026-01-26
>  16:19:54,966 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ConfigurationMutationACLPolicyFactory:
>  Using ConfigurationMutationACLPolicy implementation - class 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.DefaultConfigurationMutationACLPolicy2026-01-26
>  16:19:54,966 INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
>  failed in state INITEDjava.lang.NullPointerException: Cannot enter 
> synchronized block because "other" is null   at 
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:845)  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.loadConfiguration(MutableCSConfigurationProvider.java:102)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:303)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to