[
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589059#comment-14589059
]
Karthik Kambatla commented on YARN-3811:
----------------------------------------
This wasn't as big an issue without work-preserving RM restart, as the AM
itself would be restarted and the window of opportunity for it to try launching
containers was fairly small.
bq. the right solution is for clients to retry NMNotYetReadyException
I kind of agree, but this is a remote exception for the client (MR-AM in this
case). What is the best way to handle remote exceptions?
> NM restarts could lead to app failures
> --------------------------------------
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.7.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has
> registered with the RM. In MR, this is considered a task attempt failure. A
> few of these could lead to a task/job failure.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)