subject:"\[jira\] \[Commented\] \(YARN\-3811\) NM restarts could lead to app failures"

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-19 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593852#comment-14593852
 ] 

Vinod Kumar Vavilapalli commented on YARN-3811:
---

bq.  For NM work-preserving restart, I found the code already make sure 
everything starts first before starting the containerManager server.
I didn't realize this, tx for pointing out.

Had an offline discussion with [~jianhe] and couldn't come up with a case where 
not blocking the calls will be a problem. In all the cases, whether the calls 
are blocked or not, eventually they will be rejected for invalid-token-error or 
container-given-by-old-RM. Even if the calls are not blocked, the same errors 
happen right-away.

I am +1 now for not throwing this exception from the NM side. But given that it 
is part of the contract, I don't think we should remove the class in case.

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590319#comment-14590319
 ] 

Jian He commented on YARN-3811:
---

bq. this is not possible to do as the NM needs to report the RPC server port 
during registration - so, server start should happen before registration.
For RM work-preserving restart, this is not a problem as the NM remain as-is.
For NM restart with no recovery, all outstanding containers allocated on this 
node are anyways killed.
For NM work-preserving restart, I found the code already make sure everything 
starts first before starting the containerManager server.
{code}
if (delayedRpcServerStart) {
  waitForRecoveredContainers();
  server.start();
{code}

Overall, I think it's fine to add a client retry fix in 2.7.1;But long term 
I'd like to re-visit this, may be I still miss something. 

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590207#comment-14590207
 ] 

Jason Lowe commented on YARN-3811:
--

bq. this is not possible to do as the NM needs to report the RPC server port 
during registration - so, server start should happen before registration.
Yes, but that's a limitation in the RPC layer.  If we could bind the server 
before we start it then we could know the port, register with the RM, then 
start the server.  IMHO the RPC layer should support this, but I understand 
we'll have to work around the lack of that in the interim.  I think we all can 
agree the retry exception is just a hack being used because we can't keep the 
client service from serving too soon.

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590164#comment-14590164
 ] 

Vinod Kumar Vavilapalli commented on YARN-3811:
---

bq. We should also consider graceful NM decommission. For graceful 
decommission, the RM should refrain from assigning more tasks to the node in 
question. Should we also prevent AMs that have already been assigned this node 
from starting new containers? In that case, I guess we would not be throwing 
NMNotYetReadyException, but another YarnException - NMShuttingDownException?
[~kasha], we could. Let's file a separate JIRA?

bq. we should just avoid opening or processing the client port until we've 
registered with the RM if it's really a problem in practice
[~jlowe], this is not possible to do as the NM needs to report the RPC server 
port during registration - so, server start should happen before registration.

bq. 2. For NM restart with no recovery support, startContainer will fail 
anyways because the NMToken is not valid.
bq. 3. For work-preserving RM restart, containers launched before NM 
re-register can be recovered on RM when NM sends the container status across. 
startContainer call after re-register will fail because the NMToken is not 
valid.
[~jianhe], these two errors will be much harder for apps to process and react 
to than the current named exception.

Further, things like Auxiliary services are also not setup already by time the 
RPC server starts and depending on how the service order changes over time, 
users may get different types of errors. Overall, I am in favor of keeping the 
named exception with clients explicitly retrying. 

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Jason Lowe (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589893#comment-14589893
 ] 

Jason Lowe commented on YARN-3811:
--

I agree with Jian that we probably don't need the not ready exception.  I was 
never a fan of it in the first place, as IMHO we should just avoid opening or 
processing the client port until we've registered with the RM if it's really a 
problem in practice.  As Jian points out, I think the NMToken will cover the 
cases where someone is trying to launch something they shouldn't be launching, 
so I don't think we need to wait for the RM registration.

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-17 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589888#comment-14589888
 ] 

Karthik Kambatla commented on YARN-3811:


We should also consider graceful NM decommission. For graceful decommission, 
the RM should refrain from assigning more tasks to the node in question. Should 
we also prevent AMs that have already been assigned this node from starting new 
containers? In that case, I guess we would not be throwing 
NMNotYetReadyException, but another YarnException - NMShuttingDownException?

On the client side (MR-AM in this case), we should probably consider any 
{{YarnException}} as a system error and count it against KILLED?

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Jian He (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589256#comment-14589256
 ] 

Jian He commented on YARN-3811:
---

I'm actually thinking do we still need the NMNotYetReadyException.. the 
NMNotYetReadyException is currently thrown when NM starts the service but not 
yet register/re-register with RM.  it may be ok to just launch the container. 

1. For work-preserving NM restart(scenario in this jira), I think it's ok to 
just launch the container instead of throwing exception.
2. For NM restart with no recovery support,  startContainer will fail anyways 
because the NMToken is not valid.
3. For work-preserving RM restart, containers launched before NM re-register 
can be recovered on RM when NM sends the container status across. 
startContainer call after re-register will fail because the NMToken is not 
valid. 

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589240#comment-14589240
 ] 

Vinod Kumar Vavilapalli commented on YARN-3811:
---

bq. I kind of agree, but this is a remote exception for the client (MR-AM in 
this case). What is the best way to handle remote exceptions? 
The client should already be unwrapping and throwing the right exception 
locally. The diagnostic message you posted also seems to be pointing the same..

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589098#comment-14589098
 ] 

Karthik Kambatla commented on YARN-3811:


By the way, here is the stack trace:

{noformat}
2015-06-16 17:31:36,663 INFO [AsyncDispatcher event handler] 
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report 
from attempt_1434500031312_0008_m_35_0: Container launch failed for 
container_e04_1434500031312_0008_01_37 : 
org.apache.hadoop.yarn.exceptions.NMNotYetReadyException: Rejecting new 
containers as NodeManager has not yet connected with ResourceManager
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at 
org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:101)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:99)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy40.startContainers(Unknown Source)
at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:151)
at 
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:369)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.NMNotYetReadyException):
 Rejecting new containers as NodeManager has not yet connected with 
ResourceManager
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.startContainers(ContainerManagerImpl.java:693)
at 
org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.startContainers(ContainerManagementProtocolPBServiceImpl.java:60)
at 
org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:95)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2038)

at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
a

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589059#comment-14589059
 ] 

Karthik Kambatla commented on YARN-3811:


This wasn't as big an issue without work-preserving RM restart, as the AM 
itself would be restarted and the window of opportunity for it to try launching 
containers was fairly small.

bq. the right solution is for clients to retry NMNotYetReadyException
I kind of agree, but this is a remote exception for the client (MR-AM in this 
case). What is the best way to handle remote exceptions? 

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589016#comment-14589016
 ] 

Vinod Kumar Vavilapalli commented on YARN-3811:
---

This is a long standing issue - we added the exception in YARN-562.

I think that instead of blanket retries (solution #1) above, the right solution 
is for clients to retry NMNotYetReadyException. We can do that in NMClient 
library for java clients? /cc [~jianhe]

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588995#comment-14588995
 ] 

Karthik Kambatla commented on YARN-3811:


The issue is with counting container-launch-failures against the 4 task 
failures. We could potentially go about this in different ways:
# Support retries when launching containers. Start/stop containers are 
@AtMostOnce operations. This works okay for NM restart cases. When an NM goes 
down, this will lead to the job waiting longer before trying another node.
# On failure to launch container, return an error code that explicitly 
annotates it as a system error and not a user error. The AMs could choose to 
not count system errors against number of task attempt failures. 
# Without any changes in Yarn, MR should identify exceptions on 
startContainers() different from failures captured in 
StartContainersResponse#getFailedRequests. That is, NMNotYetReadyException and 
IOException will not be counted against the number of allowed failures. 

Option 2 seems like a cleaner approach to me. 

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

2015-06-16 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-3811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14588984#comment-14588984
 ] 

Karthik Kambatla commented on YARN-3811:


We ran into this in our rolling upgrade tests. 

> NM restarts could lead to app failures
> --
>
> Key: YARN-3811
> URL: https://issues.apache.org/jira/browse/YARN-3811
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.7.0
>Reporter: Karthik Kambatla
>Assignee: Karthik Kambatla
>Priority: Critical
>
> Consider the following scenario:
> 1. RM assigns a container on node N to an app A.
> 2. Node N is restarted
> 3. A tries to launch container on node N.
> 3 could lead to an NMNotYetReadyException depending on whether NM N has 
> registered with the RM. In MR, this is considered a task attempt failure. A 
> few of these could lead to a task/job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

[jira] [Commented] (YARN-3811) NM restarts could lead to app failures

13 matches

Site Navigation

Mail list logo

Footer information