[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588995#comment-14588995 ]
Karthik Kambatla commented on YARN-3811: ---------------------------------------- The issue is with counting container-launch-failures against the 4 task failures. We could potentially go about this in different ways: # Support retries when launching containers. Start/stop containers are @AtMostOnce operations. This works okay for NM restart cases. When an NM goes down, this will lead to the job waiting longer before trying another node. # On failure to launch container, return an error code that explicitly annotates it as a system error and not a user error. The AMs could choose to not count system errors against number of task attempt failures. # Without any changes in Yarn, MR should identify exceptions on startContainers() different from failures captured in StartContainersResponse#getFailedRequests. That is, NMNotYetReadyException and IOException will not be counted against the number of allowed failures. Option 2 seems like a cleaner approach to me. > NM restarts could lead to app failures > -------------------------------------- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.7.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)