Jian He commented on YARN-3811:

I'm actually thinking do we still need the NMNotYetReadyException.. the 
NMNotYetReadyException is currently thrown when NM starts the service but not 
yet register/re-register with RM.  it may be ok to just launch the container. 

1. For work-preserving NM restart(scenario in this jira), I think it's ok to 
just launch the container instead of throwing exception.
2. For NM restart with no recovery support,  startContainer will fail anyways 
because the NMToken is not valid.
3. For work-preserving RM restart, containers launched before NM re-register 
can be recovered on RM when NM sends the container status across. 
startContainer call after re-register will fail because the NMToken is not 

> NM restarts could lead to app failures
> --------------------------------------
>                 Key: YARN-3811
>                 URL: https://issues.apache.org/jira/browse/YARN-3811
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.

This message was sent by Atlassian JIRA

Reply via email to