[ 
https://issues.apache.org/jira/browse/YARN-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ferenc Erdelyi updated YARN-11924:
----------------------------------
    Description: 
Should a "yarn resourcemanager -format-state-store" command be issued while one 
of the RM is starting and in the INIT state (because of YARN-11551), there is a 
time period when the /confstore/CONF_STORE path does not exist, hence the 
getZkData method returns a null value, causing the RM to fail. To prevent this, 
add a check and re-try mechanism before giving up.

 
{code:java}
FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
starting ResourceManagerorg.apache.hadoop.service.ServiceStateException: 
org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues    
    at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
      at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)     at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
    at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:875)
 at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)    
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1293)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:334)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)   
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1580)Caused
 by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize 
queues at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:738)
      at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:312)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:403)
   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)  
   ... 7 moreCaused by: java.lang.IllegalStateException: Queue configuration 
missing child queue names for root    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateParent(CapacitySchedulerQueueManager.java:741)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:255)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(CapacitySchedulerQueueManager.java:177)
      at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:729)
      ... 10 more {code}
 

As a troubleshooting step, I've added locks, and this is the point when noticed 
that the underlying issue is an NPE:

 
{code:java}
2026-01-26 16:19:54,961 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
 ZK confstore is locked for reading. Readers can access it.2026-01-26 
16:19:54,962 ERROR 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
 Failed to retrieve configuration from zookeeper 
storeorg.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
NoNode for /confstore/CONF_STORE      at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:118)        at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)  at 
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
     at 
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
     at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)        at 
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
   at 
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
    at 
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
     at 
org.apache.hadoop.util.curator.ZKCuratorManager.getData(ZKCuratorManager.java:240)
   at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.getZkData(ZKConfigurationStore.java:334)
  at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.retrieve(ZKConfigurationStore.java:272)
   at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.init(MutableCSConfigurationProvider.java:83)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:302)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
   at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
    at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
 at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)    
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)   
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)2026-01-26
 16:19:54,966 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.ConfigurationMutationACLPolicyFactory:
 Using ConfigurationMutationACLPolicy implementation - class 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.DefaultConfigurationMutationACLPolicy2026-01-26
 16:19:54,966 INFO org.apache.hadoop.service.AbstractService: Service 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
 failed in state INITEDjava.lang.NullPointerException: Cannot enter 
synchronized block because "other" is null   at 
org.apache.hadoop.conf.Configuration.<init>(Configuration.java:845)  at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.loadConfiguration(MutableCSConfigurationProvider.java:102)
      at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:303)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
   at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
    at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
 at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)    
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)   
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)
 {code}

  was:
Should a "yarn resourcemanager -format-state-store" command be issued while one 
of the RM is starting and in the INIT state (because of YARN-11551), there is a 
time period when the /confstore/CONF_STORE path does not exist, hence the 
getZkData method returns a null value, causing the RM to fail. To prevent this, 
add a check and re-try mechanism before giving up.

 
{code:java}
FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
starting ResourceManagerorg.apache.hadoop.service.ServiceStateException: 
org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues    
    at 
org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
      at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)     at 
org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
    at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:875)
 at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)    
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1293)
 at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:334)
  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)   
  at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1580)Caused
 by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize 
queues at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:738)
      at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:312)
 at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:403)
   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)  
   ... 7 moreCaused by: java.lang.IllegalStateException: Queue configuration 
missing child queue names for root    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateParent(CapacitySchedulerQueueManager.java:741)
        at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:255)
    at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(CapacitySchedulerQueueManager.java:177)
      at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:729)
      ... 10 more {code}


> Add zkManager.exists(path) check to ZKConfigurationStore:getZkData() and 
> retry mechanism
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-11924
>                 URL: https://issues.apache.org/jira/browse/YARN-11924
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ferenc Erdelyi
>            Assignee: Ferenc Erdelyi
>            Priority: Major
>              Labels: pull-request-available
>
> Should a "yarn resourcemanager -format-state-store" command be issued while 
> one of the RM is starting and in the INIT state (because of YARN-11551), 
> there is a time period when the /confstore/CONF_STORE path does not exist, 
> hence the getZkData method returns a null value, causing the RM to fail. To 
> prevent this, add a check and re-try mechanism before giving up.
>  
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
> starting ResourceManagerorg.apache.hadoop.service.ServiceStateException: 
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues  
>     at 
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
>       at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:875)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1293)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:334)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1580)Caused
>  by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize 
> queues at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:738)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:312)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:403)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)     
> ... 7 moreCaused by: java.lang.IllegalStateException: Queue configuration 
> missing child queue names for root    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateParent(CapacitySchedulerQueueManager.java:741)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:255)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(CapacitySchedulerQueueManager.java:177)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:729)
>       ... 10 more {code}
>  
> As a troubleshooting step, I've added locks, and this is the point when 
> noticed that the underlying issue is an NPE:
>  
> {code:java}
> 2026-01-26 16:19:54,961 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  ZK confstore is locked for reading. Readers can access it.2026-01-26 
> 16:19:54,962 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
>  Failed to retrieve configuration from zookeeper 
> storeorg.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode for /confstore/CONF_STORE    at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)        
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at 
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972)  at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
>      at 
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
>      at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)        
> at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
>    at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
>     at 
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
>      at 
> org.apache.hadoop.util.curator.ZKCuratorManager.getData(ZKCuratorManager.java:240)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.getZkData(ZKConfigurationStore.java:334)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.retrieve(ZKConfigurationStore.java:272)
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.init(MutableCSConfigurationProvider.java:83)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:302)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)2026-01-26
>  16:19:54,966 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ConfigurationMutationACLPolicyFactory:
>  Using ConfigurationMutationACLPolicy implementation - class 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.DefaultConfigurationMutationACLPolicy2026-01-26
>  16:19:54,966 INFO org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
>  failed in state INITEDjava.lang.NullPointerException: Cannot enter 
> synchronized block because "other" is null   at 
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:845)  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.loadConfiguration(MutableCSConfigurationProvider.java:102)
>       at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:303)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
>    at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)     
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
>  at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)  
>    at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
>  at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
>   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165) 
>     at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to