[
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589888#comment-14589888
]
Karthik Kambatla commented on YARN-3811:
----------------------------------------
We should also consider graceful NM decommission. For graceful decommission,
the RM should refrain from assigning more tasks to the node in question. Should
we also prevent AMs that have already been assigned this node from starting new
containers? In that case, I guess we would not be throwing
NMNotYetReadyException, but another YarnException - NMShuttingDownException?
On the client side (MR-AM in this case), we should probably consider any
{{YarnException}} as a system error and count it against KILLED?
> NM restarts could lead to app failures
> --------------------------------------
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.7.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has
> registered with the RM. In MR, this is considered a task attempt failure. A
> few of these could lead to a task/job failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)