[
https://issues.apache.org/jira/browse/YARN-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18060986#comment-18060986
]
ASF GitHub Bot commented on YARN-11924:
---------------------------------------
ferdelyi opened a new pull request, #8222:
URL: https://github.com/apache/hadoop/pull/8222
…getZkData() and retry mechanism
Should a "yarn resourcemanager -format-state-store" command be issued while
one of the RM is starting and in the INIT state (because of YARN-11551), there
is a time period when the /confstore/CONF_STORE path does not exist, hence the
getZkData method returns a null value, causing the RM to fail. To prevent this,
add a check and re-try mechanism before giving up.
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
Rare race condition is addressed when "yarn resourcemanager
-format-state-store" issued when an RM is in the INITING state (already
initialized the confstore) right before reading it. This change avoids a null
pointer exception.
### How was this patch tested?
Manually with locks introduced in the RM at the confstore format step with
sleep, so while one of the RM is formatting the statestore, the other RM will
be at the getZkData method trying to read the confstore in the INIT state.
Also with added unit tests.
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
### AI Tooling
If an AI tool was used:
- [ ] The PR includes the phrase "Contains content generated by <tool>"
where <tool> is the name of the AI tool used.
- [ ] My use of AI contributions follows the ASF legal policy
https://www.apache.org/legal/generative-tooling.html
> Add zkManager.exists(path) check to ZKConfigurationStore:getZkData() and
> retry mechanism
> ----------------------------------------------------------------------------------------
>
> Key: YARN-11924
> URL: https://issues.apache.org/jira/browse/YARN-11924
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Ferenc Erdelyi
> Assignee: Ferenc Erdelyi
> Priority: Major
> Labels: pull-request-available
>
> If the 'yarn resourcemanager -format-conf-store' command is issued while one
> of the RMs is in a starting state, the RM may fail. This occurs because the
> /confstore/CONF_STORE path may not yet exist (because of YARN-11551).
> Alternatively, if the confstore is in the process of being written, the
> getZkData method returns a null value, causing the crash.
> To prevent this, added a re-try mechanism before giving up.
>
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error
> starting ResourceManagerorg.apache.hadoop.service.ServiceStateException:
> org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize queues
> at
> org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:105)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:173)
> at
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:108)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:875)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1293)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:334)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1580)Caused
> by: org.apache.hadoop.yarn.exceptions.YarnException: Failed to initialize
> queues at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:738)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:312)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:403)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
> ... 7 moreCaused by: java.lang.IllegalStateException: Queue configuration
> missing child queue names for root at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.validateParent(CapacitySchedulerQueueManager.java:741)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.parseQueue(CapacitySchedulerQueueManager.java:255)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerQueueManager.initializeQueues(CapacitySchedulerQueueManager.java:177)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initializeQueues(CapacityScheduler.java:729)
> ... 10 more {code}
>
> As a troubleshooting step, I've added locks, and this is the point when I
> noticed that the underlying issue is an NPE:
>
> {code:java}
> 2026-01-26 16:19:54,961 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
> ZK confstore is locked for reading. Readers can access it.2026-01-26
> 16:19:54,962 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore:
> Failed to retrieve configuration from zookeeper
> storeorg.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /confstore/CONF_STORE at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:118)
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1972) at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
> at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:93)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
> at
> org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:35)
> at
> org.apache.hadoop.util.curator.ZKCuratorManager.getData(ZKCuratorManager.java:240)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.getZkData(ZKConfigurationStore.java:334)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.ZKConfigurationStore.retrieve(ZKConfigurationStore.java:272)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.init(MutableCSConfigurationProvider.java:83)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:302)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)2026-01-26
> 16:19:54,966 INFO
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.ConfigurationMutationACLPolicyFactory:
> Using ConfigurationMutationACLPolicy implementation - class
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.DefaultConfigurationMutationACLPolicy2026-01-26
> 16:19:54,966 INFO org.apache.hadoop.service.AbstractService: Service
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
> failed in state INITEDjava.lang.NullPointerException: Cannot enter
> synchronized block because "other" is null at
> org.apache.hadoop.conf.Configuration.<init>(Configuration.java:845) at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.conf.MutableCSConfigurationProvider.loadConfiguration(MutableCSConfigurationProvider.java:102)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.initScheduler(CapacityScheduler.java:303)
> at
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.serviceInit(CapacityScheduler.java:413)
> at
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:110)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:995)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1508)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:351)
> at org.apache.hadoop.service.AbstractService.init(AbstractService.java:165)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1797)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]