[ 
https://issues.apache.org/jira/browse/YARN-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055562#comment-18055562
 ] 

ASF GitHub Bot commented on YARN-11924:
---------------------------------------

ferdelyi commented on code in PR #8222:
URL: https://github.com/apache/hadoop/pull/8222#discussion_r2748289875


##########
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/conf/TestZKConfigurationStore.java:
##########
@@ -198,7 +198,27 @@ public void testDisableAuditLogs() throws Exception {
     prepareLogMutation("key1", "val1");
 
     data = ((ZKConfigurationStore) confStore).getZkData(logsPath);
-    assertNull(data, "Failed to Disable Audit Logs");
+    assertEquals("Failed to Disable Audit Logs", 0, data.length);
+  }
+
+  @Test
+  public void testZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);
+    String confStorePath = getZkPath("CONF_STORE");
+    byte[] data = ((ZKConfigurationStore) confStore).getZkData(confStorePath);
+    assertTrue("ConfStore data expected to be not null.", data.length > 0);
+  }
+
+  @Test
+  public void testEmptyZkData() throws Exception {
+    conf.setInt(YarnConfiguration.RM_SCHEDCONF_STORE_ZK_READ_RETRY_SECS, 0);
+    confStore.initialize(conf, schedConf, rmContext);

Review Comment:
   Sure, added. Thank you for the suggestion!





> Add zkManager.exists(path) check to ZKConfigurationStore:getZkData() and 
> retry mechanism
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-11924
>                 URL: https://issues.apache.org/jira/browse/YARN-11924
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ferenc Erdelyi
>            Assignee: Ferenc Erdelyi
>            Priority: Major
>              Labels: pull-request-available
>
> Should a "yarn resourcemanager -format-state-store" command be issued while 
> one of the RM is starting and in the INIT state (because of YARN-11551), 
> there is a time period when the /confstore/CONF_STORE path does not exist, 
> hence the getZkData method returns a null value, causing the RM to fail. To 
> prevent this, add a check and re-try mechanism before giving up.
>  
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
> starting ResourceManagerorg.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues  
>     at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:875)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1293)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:334)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1580)Caused
>  by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize 
> queues at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:738)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:312)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:403)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)     
> ... 7 moreCaused by: java.lang.IllegalStateException: Queue configuration 
> missing child queue names for root    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateParent(CapacitySchedulerQueueManager.java:741)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:255)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(CapacitySchedulerQueueManager.java:177)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:729)
>       ... 10 more {code}
>  
> As a troubleshooting step, I've added locks, and this is the point when 
> noticed that the underlying issue is an NPE:
>  
> {code:java}
> 2026-01-26 16:19:54,961 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  ZK confstore is locked for reading. Readers can access it.2026-01-26 
> 16:19:54,962 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  Failed to retrieve configuration from zookeeper 
> storeorg.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /confstore/CONF_STORE    at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)        
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)  at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
>      at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
>      at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)        
> at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
>    at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
>      at 
> org.apache.hadoop.util.curator.ZKCuratorManager.getData(ZKCuratorManager.java:240)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.getZkData(ZKConfigurationStore.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.retrieve(ZKConfigurationStore.java:272)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.init(MutableCSConfigurationProvider.java:83)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:302)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)2026-01-26
>  16:19:54,966 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ConfigurationMutationACLPolicyFactory:
>  Using ConfigurationMutationACLPolicy implementation - class 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.DefaultConfigurationMutationACLPolicy2026-01-26
>  16:19:54,966 INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
>  failed in state INITEDjava.lang.NullPointerException: Cannot enter 
> synchronized block because "other" is null   at 
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:845)  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.loadConfiguration(MutableCSConfigurationProvider.java:102)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:303)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to