[jira] [Updated] (YARN-10896) RM fail over is not reporting the nodes DECOMMISSIONED

Sushil Ks (Jira) Wed, 25 Aug 2021 04:44:06 -0700


     [ 
https://issues.apache.org/jira/browse/YARN-10896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sushil Ks updated YARN-10896:
-----------------------------
    Description: 
Whenever we add the host entries into the exclude file in order to DECOMMISSION 
the Nodemanager, we would issue the *yarn rmadmin -refreshNodes* command to 
transition the nodes from RUNNING to DECOMMISSIONED state. However if the fail 
over to standby resource manager happens and the exclude file has the list of 
hosts to be disallowed, then these disallowed nodes are never seen through the 
Cluster Metrics on the new active resource manager. 

Whatever host entries that are present in the exclude files are being listed in 
the Cluster Metrics whenever resource manager is restarted, i.e as part of the 
service init of *NodeListManager* , however during fail over this info is lost. 
Hence this patch tries to set the  *DECOMMISSIONED* nodes inside the RM Context 
so that its available through Cluster Metrics whenever we issue the *yarn 
rmadmin -refreshNodes* command.

  was:
Whenever we add the host entries into the exclude file in order to DECOMMISSION 
the Nodemanager, we would issue the *yarn rmadmin -refreshNodes* command to 
transition the nodes from RUNNING to DECOMMISSIONED state. However if the fail 
over to standby resource manager happens and the exclude file has the list of 
hosts to be disallowed, then these disallowed nodes are never seen through the 
Cluster Metrics on the new active resource manager. 

Whatever host entries that are present in the exclude files are being listed in 
the Cluster Metrics whenever resource manager is restarted, i.e as part of the 
service init of *NodeListManager* , however during fail over this info is lost. 
Hence this patch tries to set the  * *DECOMMISSIONED* nodes inside the RM 
Context so that its available through Cluster Metrics whenever we issue the 
**yarn rmadmin -refreshNodes* command.


> RM fail over is not reporting the nodes DECOMMISSIONED 
> -------------------------------------------------------
>
>                 Key: YARN-10896
>                 URL: https://issues.apache.org/jira/browse/YARN-10896
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Sushil Ks
>            Assignee: Sushil Ks
>            Priority: Major
>         Attachments: YARN-10896.001.patch
>
>
> Whenever we add the host entries into the exclude file in order to 
> DECOMMISSION the Nodemanager, we would issue the *yarn rmadmin -refreshNodes* 
> command to transition the nodes from RUNNING to DECOMMISSIONED state. However 
> if the fail over to standby resource manager happens and the exclude file has 
> the list of hosts to be disallowed, then these disallowed nodes are never 
> seen through the Cluster Metrics on the new active resource manager. 
> Whatever host entries that are present in the exclude files are being listed 
> in the Cluster Metrics whenever resource manager is restarted, i.e as part of 
> the service init of *NodeListManager* , however during fail over this info is 
> lost. Hence this patch tries to set the  *DECOMMISSIONED* nodes inside the RM 
> Context so that its available through Cluster Metrics whenever we issue the 
> *yarn rmadmin -refreshNodes* command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (YARN-10896) RM fail over is not reporting the nodes DECOMMISSIONED

Reply via email to