[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593852#comment-14593852 ] Vinod Kumar Vavilapalli commented on YARN-3811: --- bq. For NM work-preserving restart, I found the code already make sure everything starts first before starting the containerManager server. I didn't realize this, tx for pointing out. Had an offline discussion with [~jianhe] and couldn't come up with a case where not blocking the calls will be a problem. In all the cases, whether the calls are blocked or not, eventually they will be rejected for invalid-token-error or container-given-by-old-RM. Even if the calls are not blocked, the same errors happen right-away. I am +1 now for not throwing this exception from the NM side. But given that it is part of the contract, I don't think we should remove the class in case. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590319#comment-14590319 ] Jian He commented on YARN-3811: --- bq. this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. For RM work-preserving restart, this is not a problem as the NM remain as-is. For NM restart with no recovery, all outstanding containers allocated on this node are anyways killed. For NM work-preserving restart, I found the code already make sure everything starts first before starting the containerManager server. {code} if (delayedRpcServerStart) { waitForRecoveredContainers(); server.start(); {code} Overall, I think it's fine to add a client retry fix in 2.7.1;But long term I'd like to re-visit this, may be I still miss something. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590207#comment-14590207 ] Jason Lowe commented on YARN-3811: -- bq. this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. Yes, but that's a limitation in the RPC layer. If we could bind the server before we start it then we could know the port, register with the RM, then start the server. IMHO the RPC layer should support this, but I understand we'll have to work around the lack of that in the interim. I think we all can agree the retry exception is just a hack being used because we can't keep the client service from serving too soon. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590164#comment-14590164 ] Vinod Kumar Vavilapalli commented on YARN-3811: --- bq. We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException? [~kasha], we could. Let's file a separate JIRA? bq. we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice [~jlowe], this is not possible to do as the NM needs to report the RPC server port during registration - so, server start should happen before registration. bq. 2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid. bq. 3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid. [~jianhe], these two errors will be much harder for apps to process and react to than the current named exception. Further, things like Auxiliary services are also not setup already by time the RPC server starts and depending on how the service order changes over time, users may get different types of errors. Overall, I am in favor of keeping the named exception with clients explicitly retrying. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589893#comment-14589893 ] Jason Lowe commented on YARN-3811: -- I agree with Jian that we probably don't need the not ready exception. I was never a fan of it in the first place, as IMHO we should just avoid opening or processing the client port until we've registered with the RM if it's really a problem in practice. As Jian points out, I think the NMToken will cover the cases where someone is trying to launch something they shouldn't be launching, so I don't think we need to wait for the RM registration. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589888#comment-14589888 ] Karthik Kambatla commented on YARN-3811: We should also consider graceful NM decommission. For graceful decommission, the RM should refrain from assigning more tasks to the node in question. Should we also prevent AMs that have already been assigned this node from starting new containers? In that case, I guess we would not be throwing NMNotYetReadyException, but another YarnException - NMShuttingDownException? On the client side (MR-AM in this case), we should probably consider any {{YarnException}} as a system error and count it against KILLED? > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589256#comment-14589256 ] Jian He commented on YARN-3811: --- I'm actually thinking do we still need the NMNotYetReadyException.. the NMNotYetReadyException is currently thrown when NM starts the service but not yet register/re-register with RM. it may be ok to just launch the container. 1. For work-preserving NM restart(scenario in this jira), I think it's ok to just launch the container instead of throwing exception. 2. For NM restart with no recovery support, startContainer will fail anyways because the NMToken is not valid. 3. For work-preserving RM restart, containers launched before NM re-register can be recovered on RM when NM sends the container status across. startContainer call after re-register will fail because the NMToken is not valid. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589240#comment-14589240 ] Vinod Kumar Vavilapalli commented on YARN-3811: --- bq. I kind of agree, but this is a remote exception for the client (MR-AM in this case). What is the best way to handle remote exceptions? The client should already be unwrapping and throwing the right exception locally. The diagnostic message you posted also seems to be pointing the same.. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589098#comment-14589098 ] Karthik Kambatla commented on YARN-3811: By the way, here is the stack trace: {noformat} 2015-06-16 17:31:36,663 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1434500031312_0008_m_35_0: Container launch failed for container_e04_1434500031312_0008_01_37 : org.apache.hadoop.yarn.exceptions.NMNotYetReadyException: Rejecting new containers as NodeManager has not yet connected with ResourceManager at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:99) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy40.startContainers(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:151) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.NMNotYetReadyException): Rejecting new containers as NodeManager has not yet connected with ResourceManager at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693) at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60) at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038) at org.apache.hadoop.ipc.Client.call(Client.java:1468) at org.apache.hadoop.ipc.Client.call(Client.java:1399) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) a
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589059#comment-14589059 ] Karthik Kambatla commented on YARN-3811: This wasn't as big an issue without work-preserving RM restart, as the AM itself would be restarted and the window of opportunity for it to try launching containers was fairly small. bq. the right solution is for clients to retry NMNotYetReadyException I kind of agree, but this is a remote exception for the client (MR-AM in this case). What is the best way to handle remote exceptions? > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589016#comment-14589016 ] Vinod Kumar Vavilapalli commented on YARN-3811: --- This is a long standing issue - we added the exception in YARN-562. I think that instead of blanket retries (solution #1) above, the right solution is for clients to retry NMNotYetReadyException. We can do that in NMClient library for java clients? /cc [~jianhe] > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588995#comment-14588995 ] Karthik Kambatla commented on YARN-3811: The issue is with counting container-launch-failures against the 4 task failures. We could potentially go about this in different ways: # Support retries when launching containers. Start/stop containers are @AtMostOnce operations. This works okay for NM restart cases. When an NM goes down, this will lead to the job waiting longer before trying another node. # On failure to launch container, return an error code that explicitly annotates it as a system error and not a user error. The AMs could choose to not count system errors against number of task attempt failures. # Without any changes in Yarn, MR should identify exceptions on startContainers() different from failures captured in StartContainersResponse#getFailedRequests. That is, NMNotYetReadyException and IOException will not be counted against the number of allowed failures. Option 2 seems like a cleaner approach to me. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3811) NM restarts could lead to app failures
[ https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588984#comment-14588984 ] Karthik Kambatla commented on YARN-3811: We ran into this in our rolling upgrade tests. > NM restarts could lead to app failures > -- > > Key: YARN-3811 > URL: https://issues.apache.org/jira/browse/YARN-3811 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.7.0 >Reporter: Karthik Kambatla >Assignee: Karthik Kambatla >Priority: Critical > > Consider the following scenario: > 1. RM assigns a container on node N to an app A. > 2. Node N is restarted > 3. A tries to launch container on node N. > 3 could lead to an NMNotYetReadyException depending on whether NM N has > registered with the RM. In MR, this is considered a task attempt failure. A > few of these could lead to a task/job failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)