Junping Du commented on YARN-914:

Hi [~mingma], Thanks for comments here.
bq. So YARN will reduce the capacity of the nodes as part of the decomission 
process until all its map output are fetched or until all the applications the 
node touches have completed?
Yes. I am not sure if it is necessary for YARN to mark additional 
decommissioned on the node as node's resource is already updated to 0, and no 
container will get chance to be allocated on the node. Auxiliary service should 
still be running which shouldn't consume much resource if no request of service.

bq. In addition, it will be interesting to understand how you handle long 
running jobs.
Do you mean long-running services? 
First, I think we should support a timeout in drain resources of the node 
(ResourceOption already has timeout in design). So running containers should be 
preempted if run out of time. 
Second, we should support special container tag for the long running services 
(some discussions in YARN-1039) so we don't have to waste time to wait 
container finish until timeout. 
Third, in prospective of operation, we could add long-running label to specific 
nodes and try not to do decommission on nodes with long-running tag.
Let me know if this make sense to you.

> Support graceful decommission of nodemanager
> --------------------------------------------
>                 Key: YARN-914
>                 URL: https://issues.apache.org/jira/browse/YARN-914
>             Project: Hadoop YARN
>          Issue Type: Improvement
>    Affects Versions: 2.0.4-alpha
>            Reporter: Luke Lu
>            Assignee: Junping Du
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.

This message was sent by Atlassian JIRA

Reply via email to