[ 
https://issues.apache.org/jira/browse/YARN-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055470#comment-18055470
 ] 

ASF GitHub Bot commented on YARN-11924:
---------------------------------------

ferdelyi opened a new pull request, #8222:
URL: https://github.com/apache/hadoop/pull/8222

   …getZkData() and retry mechanism
   
   Should a "yarn resourcemanager -format-state-store" command be issued while 
one of the RM is starting and in the INIT state (because of YARN-11551), there 
is a time period when the /confstore/CONF_STORE path does not exist, hence the 
getZkData method returns a null value, causing the RM to fail. To prevent this, 
add a check and re-try mechanism before giving up.
   
   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   Rare race condition is addressed when "yarn resourcemanager 
-format-state-store" issued when an RM is in the INITING state (already 
initialized the confstore) right before reading it. This change avoids a null 
pointer exception.
   
   ### How was this patch tested?
   Manually with locks introduced in the RM at the confstore format step with 
sleep, so while one of the RM is formatting the statestore, the other RM will 
be at the getZkData method trying to read the confstore in the INIT state.
   Also with added unit tests.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   ### AI Tooling
   
   If an AI tool was used:
   
   - [ ] The PR includes the phrase "Contains content generated by <tool>"
         where <tool> is the name of the AI tool used.
   - [ ] My use of AI contributions follows the ASF legal policy
         https://www.apache.org/legal/generative-tooling.html




> Add zkManager.exists(path) check to ZKConfigurationStore:getZkData() and 
> retry mechanism
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-11924
>                 URL: https://issues.apache.org/jira/browse/YARN-11924
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Ferenc Erdelyi
>            Assignee: Ferenc Erdelyi
>            Priority: Major
>
> Should a "yarn resourcemanager -format-state-store" command be issued while 
> one of the RM is starting and in the INIT state (because of YARN-11551), 
> there is a time period when the /confstore/CONF_STORE path does not exist, 
> hence the getZkData method returns a null value, causing the RM to fail. To 
> prevent this, add a check and re-try mechanism before giving up.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to