Vinod Kumar Vavilapalli commented on YARN-3811:
bq. We should also consider graceful NM decommission. For graceful
decommission, the RM should refrain from assigning more tasks to the node in
question. Should we also prevent AMs that have already been assigned this node
from starting new containers? In that case, I guess we would not be throwing
NMNotYetReadyException, but another YarnException - NMShuttingDownException?
[~kasha], we could. Let's file a separate JIRA?
bq. we should just avoid opening or processing the client port until we've
registered with the RM if it's really a problem in practice
[~jlowe], this is not possible to do as the NM needs to report the RPC server
port during registration - so, server start should happen before registration.
bq. 2. For NM restart with no recovery support, startContainer will fail
anyways because the NMToken is not valid.
bq. 3. For work-preserving RM restart, containers launched before NM
re-register can be recovered on RM when NM sends the container status across.
startContainer call after re-register will fail because the NMToken is not
[~jianhe], these two errors will be much harder for apps to process and react
to than the current named exception.
Further, things like Auxiliary services are also not setup already by time the
RPC server starts and depending on how the service order changes over time,
users may get different types of errors. Overall, I am in favor of keeping the
named exception with clients explicitly retrying.
> NM restarts could lead to app failures
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 2.7.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Critical
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has
> registered with the RM. In MR, this is considered a task attempt failure. A
> few of these could lead to a task/job failure.
This message was sent by Atlassian JIRA