Jason Lowe commented on YARN-914:
Thanks for updating the doc, Junping. Additional comments:
Nit: How about DECOMMISSIONING instead of DECOMMISSION_IN_PROGRESS?
The design says when a node starts decommissioning we will remove its resources
from the cluster, but that's not really the case, correct? We should remove
its available (not total) resources from the cluster then continue to remove
available resources as containers complete on that node. Failing to do so will
result in weird metrics like more resources running on the cluster than the
cluster says it has, etc.
Are we only going to support graceful decommission via updates to the
include/exclude files and refresh? Not needed for the initial cut, but
thinking of a couple of use-cases and curious what others thought:
* Would be convenient to have an rmadmin command that does this in one step,
especially for a single-node. Arguably if we are persisting cluster nodes in
the state store we can migrate the list there, and the include/exclude list
simply become convenient ways to batch-update the cluster state.
* Will NMs be able to request a graceful decommission via their health check
script? There have been some cases in the past where it would have been nice
for the NM to request a ramp-down on containers but not instantly kill all of
them with an UNHEALTHY report.
As for the UI changes, initial thought is that decommissioning nodes should
still show up in the active nodes list since they are still running containers.
A separate decommissioning tab to filter for those nodes would be nice,
although I suppose users can also just use the jquery table to sort/search for
nodes in that state from the active nodes list if it's too crowded to add yet
another node state tab (or maybe get rid of some effectively dead tabs like the
reboot state tab).
For the NM restart open question, this should no longer an issue now that the
NM is unaware of graceful decommission All the RM needs to do is ensure that a
node that is rejoining the cluster when the RM thought it was already part of
it retains its previous running/decommissioning state. That way if an NM is
decommissioning before the restart it will continue to decommission after it
For the AM dealing with being notified of decommissioning, again I think this
should just be treated like a strict preemption for the short term. IMHO all
the AM needs to know is that the RM is planning on taking away those
containers, and what the AM should do about it is similar whether the reason
for removal is preemption or decommissioning.
Back to the long running services delaying decommissioning concern, does YARN
even know the difference between a long-running container and a "normal"
container? If it doesn't, how is it supposed to know a container is not going
to complete anytime soon? Even a "normal" container could run for many hours.
It seems to me the first thing we would need before worrying about this
scenario is the ability for YARN to know/predict the expected runtime of
There's still an open question about tracking the timeout RM side instead of NM
side. Sounds like the NM side is not going to be pursued at this point, and
we're going with no built-in timeout support in YARN for the short-term.
> Support graceful decommission of nodemanager
> Key: YARN-914
> URL: https://issues.apache.org/jira/browse/YARN-914
> Project: Hadoop YARN
> Issue Type: Improvement
> Affects Versions: 2.0.4-alpha
> Reporter: Luke Lu
> Assignee: Junping Du
> Attachments: Gracefully Decommission of NodeManager (v1).pdf,
> Gracefully Decommission of NodeManager (v2).pdf
> When NMs are decommissioned for non-fault reasons (capacity change etc.),
> it's desirable to minimize the impact to running applications.
> Currently if a NM is decommissioned, all running containers on the NM need to
> be rescheduled on other NMs. Further more, for finished map tasks, if their
> map output are not fetched by the reducers of the job, these map tasks will
> need to be rerun as well.
> We propose to introduce a mechanism to optionally gracefully decommission a
> node manager.
This message was sent by Atlassian JIRA