Junping Du commented on YARN-3212:

bq. we also need to verify the scheduler hasn't allocated or handed out a 
container for that node that hasn't reached the node yet other than only check 
application status.
Just think of this problem again. The other option is we can still go ahead to 
mark this node as decommissioned, but make AM/RM sync on the same page. 
It depends on how we understand the word - "graceful" here: if it means less 
expensive/cost in decommissioning nodes, then this case should fall into this 
category as releasing an unlaunched container is pretty cheap which could be 
better than wait the container to executed from beginning; if we think it means 
clean scheduling flow and log messages (at least within timeout), we may should 
wait container get launching. 

> RMNode State Transition Update with DECOMMISSIONING state
> ---------------------------------------------------------
>                 Key: YARN-3212
>                 URL: https://issues.apache.org/jira/browse/YARN-3212
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>         Attachments: RMNodeImpl - new.png, YARN-3212-v1.patch, 
> YARN-3212-v2.patch, YARN-3212-v3.patch
> As proposed in YARN-914, a new state of “DECOMMISSIONING” will be added and 
> can transition from “running” state triggered by a new event - 
> “decommissioning”. 
> This new state can be transit to state of “decommissioned” when 
> Resource_Update if no running apps on this NM or NM reconnect after restart. 
> Or it received DECOMMISSIONED event (after timeout from CLI).
> In addition, it can back to “running” if user decides to cancel previous 
> decommission by calling recommission on the same node. The reaction to other 
> events is similar to RUNNING state.

This message was sent by Atlassian JIRA

Reply via email to