[ 
https://issues.apache.org/jira/browse/YARN-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055559#comment-18055559
 ] 

ASF GitHub Bot commented on YARN-11924:
---------------------------------------

ferdelyi commented on code in PR #8222:
URL: https://github.com/apache/hadoop/pull/8222#discussion_r2748271721


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/ZKConfigurationStore.java:
##########
@@ -263,10 +264,46 @@ private boolean createNewZkPath(String path) throws 
Exception {
     }
   }
 
+  /*
   @VisibleForTesting
   protected byte[] getZkData(String path) throws Exception {
     return zkManager.getData(path);
   }
+*/
+
+  @VisibleForTesting
+  protected byte[] getZkData(String path) throws Exception {
+    // Should a 'yarn resourcemanager -format-state-store' command is issued
+    // while one of the RM is in a starting state, there is a time period
+    // when the /confstore/CONF_STORE path does not exist, hence the
+    // getZkData method returns a null value causing the RM to fail.
+    // To prevent this, added a re-try mechanism before giving up.
+    int maxRetries = 6;
+    int attempt = 1;
+    int sleepBetweenRetries = conf.getInt(
+        YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS,
+        YarnConfiguration.DEFAULT_RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS);
+
+    while (attempt < maxRetries) {
+
+        if(zkManager.exists(path)) {
+          LOG.debug("zkManager.exists(path) {} exists.", path);
+          byte[] zkData = zkManager.getData(path);
+          // If we found data, return immediately

Review Comment:
   You are right, thank you!





> Add zkManager.exists(path) check to ZKConfigurationStore:getZkData() and 
> retry mechanism
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-11924
>                 URL: https://issues.apache.org/jira/browse/YARN-11924
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ferenc Erdelyi
>            Assignee: Ferenc Erdelyi
>            Priority: Major
>              Labels: pull-request-available
>
> Should a "yarn resourcemanager -format-state-store" command be issued while 
> one of the RM is starting and in the INIT state (because of YARN-11551), 
> there is a time period when the /confstore/CONF_STORE path does not exist, 
> hence the getZkData method returns a null value, causing the RM to fail. To 
> prevent this, add a check and re-try mechanism before giving up.
>  
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
> starting ResourceManagerorg.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues  
>     at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:875)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1293)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:334)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1580)Caused
>  by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize 
> queues at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:738)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:312)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:403)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)     
> ... 7 moreCaused by: java.lang.IllegalStateException: Queue configuration 
> missing child queue names for root    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateParent(CapacitySchedulerQueueManager.java:741)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:255)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(CapacitySchedulerQueueManager.java:177)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:729)
>       ... 10 more {code}
>  
> As a troubleshooting step, I've added locks, and this is the point when 
> noticed that the underlying issue is an NPE:
>  
> {code:java}
> 2026-01-26 16:19:54,961 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  ZK confstore is locked for reading. Readers can access it.2026-01-26 
> 16:19:54,962 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  Failed to retrieve configuration from zookeeper 
> storeorg.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /confstore/CONF_STORE    at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)        
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)  at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
>      at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
>      at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)        
> at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
>    at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
>      at 
> org.apache.hadoop.util.curator.ZKCuratorManager.getData(ZKCuratorManager.java:240)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.getZkData(ZKConfigurationStore.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.retrieve(ZKConfigurationStore.java:272)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.init(MutableCSConfigurationProvider.java:83)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:302)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)2026-01-26
>  16:19:54,966 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ConfigurationMutationACLPolicyFactory:
>  Using ConfigurationMutationACLPolicy implementation - class 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.DefaultConfigurationMutationACLPolicy2026-01-26
>  16:19:54,966 INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
>  failed in state INITEDjava.lang.NullPointerException: Cannot enter 
> synchronized block because "other" is null   at 
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:845)  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.loadConfiguration(MutableCSConfigurationProvider.java:102)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:303)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to