[ 
https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316980#comment-14316980
 ] 

Jason Lowe commented on YARN-914:
---------------------------------

Thanks for updating the doc, Junping.  Additional comments:

Nit: How about DECOMMISSIONING instead of DECOMMISSION_IN_PROGRESS?

The design says when a node starts decommissioning we will remove its resources 
from the cluster, but that's not really the case, correct?  We should remove 
its available (not total) resources from the cluster then continue to remove 
available resources as containers complete on that node.  Failing to do so will 
result in weird metrics like more resources running on the cluster than the 
cluster says it has, etc.

Are we only going to support graceful decommission via updates to the 
include/exclude files and refresh?  Not needed for the initial cut, but 
thinking of a couple of use-cases and curious what others thought:
* Would be convenient to have an rmadmin command that does this in one step, 
especially for a single-node.  Arguably if we are persisting cluster nodes in 
the state store we can migrate the list there, and the include/exclude list 
simply become convenient ways to batch-update the cluster state.
* Will NMs be able to request a graceful decommission via their health check 
script?  There have been some cases in the past where it would have been nice 
for the NM to request a ramp-down on containers but not instantly kill all of 
them with an UNHEALTHY report.

As for the UI changes, initial thought is that decommissioning nodes should 
still show up in the active nodes list since they are still running containers. 
 A separate decommissioning tab to filter for those nodes would be nice, 
although I suppose users can also just use the jquery table to sort/search for 
nodes in that state from the active nodes list if it's too crowded to add yet 
another node state tab (or maybe get rid of some effectively dead tabs like the 
reboot state tab).

For the NM restart open question, this should no longer an issue now that the 
NM is unaware of graceful decommission  All the RM needs to do is ensure that a 
node that is rejoining the cluster when the RM thought it was already part of 
it retains its previous running/decommissioning state.  That way if an NM is 
decommissioning before the restart it will continue to decommission after it 
restarts.

For the AM dealing with being notified of decommissioning, again I think this 
should just be treated like a strict preemption for the short term.  IMHO all 
the AM needs to know is that the RM is planning on taking away those 
containers, and what the AM should do about it is similar whether the reason 
for removal is preemption or decommissioning.

Back to the long running services delaying decommissioning concern, does YARN 
even know the difference between a long-running container and a "normal" 
container?  If it doesn't, how is it supposed to know a container is not going 
to complete anytime soon?  Even a "normal" container could run for many hours.  
It seems to me the first thing we would need before worrying about this 
scenario is the ability for YARN to know/predict the expected runtime of 
containers.

There's still an open question about tracking the timeout RM side instead of NM 
side.  Sounds like the NM side is not going to be pursued at this point, and 
we're going with no built-in timeout support in YARN for the short-term.

> Support graceful decommission of nodemanager
> --------------------------------------------
>
>                 Key: YARN-914
>                 URL: https://issues.apache.org/jira/browse/YARN-914
>             Project: Hadoop YARN
>          Issue Type: Improvement
>    Affects Versions: 2.0.4-alpha
>            Reporter: Luke Lu
>            Assignee: Junping Du
>         Attachments: Gracefully Decommission of NodeManager (v1).pdf, 
> Gracefully Decommission of NodeManager (v2).pdf
>
>
> When NMs are decommissioned for non-fault reasons (capacity change etc.), 
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to 
> be rescheduled on other NMs. Further more, for finished map tasks, if their 
> map output are not fetched by the reducers of the job, these map tasks will 
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a 
> node manager.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to