[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

Karthik Kambatla (JIRA) Wed, 17 Jun 2015 08:00:34 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589888#comment-14589888
 ]


Karthik Kambatla commented on YARN-3811:
----------------------------------------

We should also consider graceful NM decommission. For graceful decommission, 
the RM should refrain from assigning more tasks to the node in question. Should 
we also prevent AMs that have already been assigned this node from starting new 
containers? In that case, I guess we would not be throwing 
NMNotYetReadyException, but another YarnException - NMShuttingDownException?

On the client side (MR-AM in this case), we should probably consider any 
{{YarnException}} as a system error and count it against KILLED?

> NM restarts could lead to app failures
> --------------------------------------
>
>                 Key: YARN-3811
>                 URL: https://issues.apache.org/jira/browse/YARN-3811
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

Reply via email to