[jira] [Commented] (YARN-10896) RM fail over is not reporting the nodes DECOMMISSIONED

ASF GitHub Bot (Jira) Sun, 26 Feb 2023 20:57:07 -0800


    [ 
https://issues.apache.org/jira/browse/YARN-10896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693787#comment-17693787
 ]


ASF GitHub Bot commented on YARN-10896:
---------------------------------------

anoopsjohn opened a new pull request, #5436:
URL: https://github.com/apache/hadoop/pull/5436

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   When RM not in HA mode, and RM restart, we handle reading the exclude file 
and keep state of decommission requested NMs.  The in memory state of NMs 
become proper to decommissioned.  
   But in HA mode, when standby become active, this state transition dont 
happen properly. The decommissioning NMs will get kill request from RM when 
those NMs heart beat to new Active RM. But the in memory state do not become 
decommissioned. So the node state API dont give proper state of DECOMMISSIONED. 
 This PR fixes this issue.
   
   ### How was this patch tested?
   Manually tested with RM failover while some of the NMs are in graceful 
decommissioning state.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> RM fail over is not reporting the nodes DECOMMISSIONED 
> -------------------------------------------------------
>
>                 Key: YARN-10896
>                 URL: https://issues.apache.org/jira/browse/YARN-10896
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Sushil Ks
>            Assignee: Sushil Ks
>            Priority: Major
>         Attachments: YARN-10896.001.patch
>
>
> Whenever we add the host entries into the exclude file in order to 
> DECOMMISSION the Nodemanager, we would issue the *yarn rmadmin -refreshNodes* 
> command to transition the nodes from RUNNING to DECOMMISSIONED state. However 
> if the fail over to standby resource manager happens and the exclude file has 
> the list of hosts to be disallowed, then these disallowed nodes are never 
> seen through the Cluster Metrics on the new active resource manager. 
> Whatever host entries that are present in the exclude files are being listed 
> in the Cluster Metrics whenever resource manager is restarted, i.e as part of 
> the service init of *NodeListManager* , however during fail over this info is 
> lost. Hence this patch tries to set the  *DECOMMISSIONED* nodes inside the RM 
> Context so that its available through Cluster Metrics whenever we issue the 
> *yarn rmadmin -refreshNodes* command.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-10896) RM fail over is not reporting the nodes DECOMMISSIONED

Reply via email to