[ https://issues.apache.org/jira/browse/YARN-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514182#comment-14514182 ]
Junping Du commented on YARN-3212: ---------------------------------- bq. we also need to verify the scheduler hasn't allocated or handed out a container for that node that hasn't reached the node yet other than only check application status. Just think of this problem again. The other option is we can still go ahead to mark this node as decommissioned, but make AM/RM sync on the same page. It depends on how we understand the word - "graceful" here: if it means less expensive/cost in decommissioning nodes, then this case should fall into this category as releasing an unlaunched container is pretty cheap which could be better than wait the container to executed from beginning; if we think it means clean scheduling flow and log messages (at least within timeout), we may should wait container get launching. Thoughts? > RMNode State Transition Update with DECOMMISSIONING state > --------------------------------------------------------- > > Key: YARN-3212 > URL: https://issues.apache.org/jira/browse/YARN-3212 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Junping Du > Assignee: Junping Du > Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, > YARN-3212-v2.patch, YARN-3212-v3.patch > > > As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and > can transition from “running” state triggered by a new event - > “decommissioning”. > This new state can be transit to state of “decommissioned” when > Resource_Update if no running apps on this NM or NM reconnect after restart. > Or it received DECOMMISSIONED event (after timeout from CLI). > In addition, it can back to “running” if user decides to cancel previous > decommission by calling recommission on the same node. The reaction to other > events is similar to RUNNING state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)