[
https://issues.apache.org/jira/browse/YARN-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055470#comment-18055470
]
ASF GitHub Bot commented on YARN-11924:
---------------------------------------
ferdelyi opened a new pull request, #8222:
URL: https://github.com/apache/hadoop/pull/8222
…getZkData() and retry mechanism
Should a "yarn resourcemanager -format-state-store" command be issued while
one of the RM is starting and in the INIT state (because of YARN-11551), there
is a time period when the /confstore/CONF_STORE path does not exist, hence the
getZkData method returns a null value, causing the RM to fail. To prevent this,
add a check and re-try mechanism before giving up.
<!--
Thanks for sending a pull request!
1. If this is your first time, please read our contributor guidelines:
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
2. Make sure your PR title starts with JIRA issue id, e.g.,
'HADOOP-17799. Your PR title ...'.
-->
### Description of PR
Rare race condition is addressed when "yarn resourcemanager
-format-state-store" issued when an RM is in the INITING state (already
initialized the confstore) right before reading it. This change avoids a null
pointer exception.
### How was this patch tested?
Manually with locks introduced in the RM at the confstore format step with
sleep, so while one of the RM is formatting the statestore, the other RM will
be at the getZkData method trying to read the confstore in the INIT state.
Also with added unit tests.
### For code changes:
- [x] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [ ] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [ ] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
### AI Tooling
If an AI tool was used:
- [ ] The PR includes the phrase "Contains content generated by <tool>"
where <tool> is the name of the AI tool used.
- [ ] My use of AI contributions follows the ASF legal policy
https://www.apache.org/legal/generative-tooling.html
> Add zkManager.exists(path) check to ZKConfigurationStore:getZkData() and
> retry mechanism
> ----------------------------------------------------------------------------------------
>
> Key: YARN-11924
> URL: https://issues.apache.org/jira/browse/YARN-11924
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Ferenc Erdelyi
> Assignee: Ferenc Erdelyi
> Priority: Major
>
> Should a "yarn resourcemanager -format-state-store" command be issued while
> one of the RM is starting and in the INIT state (because of YARN-11551),
> there is a time period when the /confstore/CONF_STORE path does not exist,
> hence the getZkData method returns a null value, causing the RM to fail. To
> prevent this, add a check and re-try mechanism before giving up.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]