subject:"\[jira\] \[Commented\] \(YARN\-2578\) NM does not failover timely if RM node network connection fails"

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Masatake Iwasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630535#comment-14630535
 ] 

Masatake Iwasaki commented on YARN-2578:


bq. 2. Would you tell me why Client.getRpcTimeout returns 0 if ipc.client.ping 
is false?

Just to make it clear that the timeout has no effect without setting 
{{ipc.client.ping}} to true.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Masatake Iwasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630532#comment-14630532
 ] 

Masatake Iwasaki commented on YARN-2578:


Yes, it is the same fix. I agree it should fixed in hadoop-common JIRA. Thanks.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-16 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629940#comment-14629940
 ] 

Ming Ma commented on YARN-2578:
---

Thanks [~iwasakims]. Is it similar to HADOOP-11252? Given your latest patch is 
in hadoop-common, it might be better to fix it as a HADOOP jira instead.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-07-15 Thread Akira AJISAKA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629250#comment-14629250
 ] 

Akira AJISAKA commented on YARN-2578:
-

Thanks [~iwasakims] for creating the patch. One comment and one question from 
me.
bq. The default value is 0 in order to keep current behaviour.
1. We would like to fix this bug, so default to 1min is good for me.
2. Would you tell me why {{Client.getRpcTimeout}} returns 0 if 
{{ipc.client.ping}} is false?

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.002.patch, YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-06-09 Thread Masatake Iwasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578948#comment-14578948
 ] 

Masatake Iwasaki commented on YARN-2578:


Note: The patch works because PingInputStream throws exception on socket 
timeout by setting rpcTimeout > 0. The {{rpcTimeout}} given as a argument of 
{{RPC#getProtocolProxy}} has effect only if {{ipc.client.ping}} is true.
{code}
  private void handleTimeout(SocketTimeoutException e) throws IOException {
if (shouldCloseConnection.get() || !running.get() || rpcTimeout > 0) {
  throw e;
} else {
  sendPing();
}
  }
{code}

{{Client}} already have {{getTimeout}} method and it returns -1 by default 
because the default value of {{ipc.client.ping}} is true.
{code}
  final public static int getTimeout(Configuration conf) {
if (!conf.getBoolean(CommonConfigurationKeys.IPC_CLIENT_PING_KEY,
CommonConfigurationKeys.IPC_CLIENT_PING_DEFAULT)) {
  return getPingInterval(conf);
}
return -1;
  }
{code}

I think changing this to always return ping interval and using it as default 
value of {{rpcTimeout}} is a option but it has wider effect because 
{{Client.getTimeout}} is used as {{DfsClientConf.hdfsTimeout}}.
{noformat}
  getTimeout   1301 
../../../../../../test/java/org/apache/hadoop/ipc/TestIPC.java 
assertEquals(Client.getTimeout(config), -1);
  getTimeout413 
../../../../../../../../../hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/NameNodeProxies.java
 org.apache.hadoop.ipc.Client.getTimeout(conf), defaultPolicy,
  getTimeout106 
../../../../../../../../../hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/DfsClientConf.java
 hdfsTimeout = Client.getTimeout(conf);
{noformat}


> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-06-03 Thread Masatake Iwasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572143#comment-14572143
 ] 

Masatake Iwasaki commented on YARN-2578:


Hi [~wilfreds], do you have any update on this? I saw the same issue in our 
cluster and the attached patch worked. I would like the fix to comes in the 
next release. If you do not have enough time, I would like to take over. 
Otherwise we can commit the current patch and fix hadoop-common later. It still 
applies to trunk and branch-2.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
>Assignee: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-05-04 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527160#comment-14527160
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

yes I am working on this at the moment.
I am also travelling but should have a new patch ready this weekend

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2015-04-27 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516201#comment-14516201
 ] 

Hadoop QA commented on YARN-2578:
-

\\
\\
| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | pre-patch |  14m 40s | Pre-patch trunk compilation is 
healthy. |
| {color:green}+1{color} | @author |   0m  0s | The patch does not contain any 
@author tags. |
| {color:red}-1{color} | tests included |   0m  0s | The patch doesn't appear 
to include any new or modified tests.  Please justify why no new tests are 
needed for this patch. Also please list what manual steps were performed to 
verify this patch. |
| {color:green}+1{color} | whitespace |   0m  0s | The patch has no lines that 
end in whitespace. |
| {color:green}+1{color} | javac |   7m 36s | There were no new javac warning 
messages. |
| {color:green}+1{color} | javadoc |   9m 37s | There were no new javadoc 
warning messages. |
| {color:green}+1{color} | release audit |   0m 22s | The applied patch does 
not increase the total number of release audit warnings. |
| {color:red}-1{color} | checkstyle |   5m 24s | The applied patch generated  1 
 additional checkstyle issues. |
| {color:green}+1{color} | install |   1m 37s | mvn install still works. |
| {color:green}+1{color} | eclipse:eclipse |   0m 33s | The patch built with 
eclipse:eclipse. |
| {color:green}+1{color} | findbugs |   0m 52s | The patch does not introduce 
any new Findbugs (version 2.0.3) warnings. |
| {color:green}+1{color} | yarn tests |   0m 25s | Tests passed in 
hadoop-yarn-server-common. |
| | |  41m  9s | |
\\
\\
|| Subsystem || Report/Notes ||
| Patch URL | 
http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch |
| Optional Tests | javadoc javac unit findbugs checkstyle |
| git revision | trunk / feb68cb |
| checkstyle | 
https://builds.apache.org/job/PreCommit-YARN-Build/7518/artifact/patchprocess/checkstyle-result-diff.txt
 |
| hadoop-yarn-server-common test log | 
https://builds.apache.org/job/PreCommit-YARN-Build/7518/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt
 |
| Test Results | 
https://builds.apache.org/job/PreCommit-YARN-Build/7518/testReport/ |
| Java | 1.7.0_55 |
| uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP 
PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux |
| Console output | 
https://builds.apache.org/job/PreCommit-YARN-Build/7518/console |


This message was automatically generated.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Sou

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-11-22 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221917#comment-14221917
 ] 

Rohith commented on YARN-2578:
--

bq. We can consider scenario for simulation that mentioned in YARN-1081
The correct issue id is YARN-1061

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-11-22 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221915#comment-14221915
 ] 

Rohith commented on YARN-2578:
--

Apologies that I linked to wrong issue previously. I deleted wrong link and 
linking to correct issue.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-11-22 Thread Rohith (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221913#comment-14221913
 ] 

Rohith commented on YARN-2578:
--

We can consider scenario for simulation that mentioned in YARN-1081.
Active RM can be made to hang up using linux command
*kill -stop *


> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-11-17 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215235#comment-14215235
 ] 

Karthik Kambatla commented on YARN-2578:


Since the RM doesn't (yet) suffer from the GC issues that NN does, we thought 
the embedded-elector is good enough. So far, it appears to be doing okay and we 
didn't see a particular need for ZKFC. 

There is an open JIRA for ZKFC (and I have an implementation that needs 
cleaning up), we can get that done if the need arises. 

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-11-17 Thread Harsh J (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214806#comment-14214806
 ] 

Harsh J commented on YARN-2578:
---

bq. We never implemented health monitoring like in ZKFC with HDFS

Was this not desired for some reason, or just punted in the early 
implementation? Seems worthy to always have such a thing.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-10-22 Thread Ming Ma (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180144#comment-14180144
 ] 

Ming Ma commented on YARN-2578:
---

Yeah, it is more than just * -> RM, it could be * -> NM and * -> AM. Agree it 
is better to fix it at hadoop common layer. From HDFS-4858, it looks like the 
concern of fixing it at hadoop common layer is the test coverage.

Is there any follow up on hadoop common? Perhaps we can fix hadoop common layer 
so that rpc timeout is still off by default; but if ping is set to false, then 
rpc timeout will be set to the ping value in the code Karthik refers to. In 
that way, YARN and MR don't need to change and people can experiment with rpc 
timeout. After enough test coverage, we can then set ping default value to 
false.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-30 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154084#comment-14154084
 ] 

Vinod Kumar Vavilapalli commented on YARN-2578:
---

bq. Instead of fixing it everywhere, how about we fix this in RPC itself? In 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488,
 instead of using 0 as the default value, the default could be looked up in the 
Configuration. No? 
+1. The default from conf is 1min. Assuming it all boils down the ping 
interval, we should fix it in common.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-25 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148452#comment-14148452
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

I proposed fixing the RPC code and by default set the timeout in HDFS-4858 but 
there was no interest to fix the client (at that point in time). So we now have 
to fix it everywhere unless we can get everyone on board and get the behaviour 
changed in the RPC code. The comments are still in that jira and it would be a 
straight forward fix in the RPC code.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-25 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148381#comment-14148381
 ] 

Karthik Kambatla commented on YARN-2578:


Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM 
and Client->RM as well? 

Instead of fixing it everywhere, how about we fix this in RPC itself? In 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488,
 instead of using 0 as the default value, the default could be looked up in the 
Configuration. No? 

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144022#comment-14144022
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

I looked into automated testing but like in HDFS-4858 I have not been able to 
find a way to test this using junit tests. Manual testing is really simple 
using the above reproduction scenario.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Wilfred Spiegelenburg (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144016#comment-14144016
 ] 

Wilfred Spiegelenburg commented on YARN-2578:
-

To address [~vinodkv] comments: The active RM is completely shut off from the 
network so are all the other services on the node, including zookeeper. The RM 
can update zookeeper but that will never be propagated outside of the node to 
the other zookeeper nodes. It can thus not be seen by the standby RM. The 
standby RM detects no updates in zookeeper for the timeout period and becomes 
the active node. That is the normal HA behaviour from the standby node as if 
the RM would have crashed.


> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144004#comment-14144004
 ] 

Hadoop QA commented on YARN-2578:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch
  against trunk revision 23e17ce.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5071//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5071//console

This message is automatically generated.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Vinod Kumar Vavilapalli (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143951#comment-14143951
 ] 

Vinod Kumar Vavilapalli commented on YARN-2578:
---

bq. The NM does not fail over correctly when the network cable of the RM is 
unplugged or the failure is simulated by a "service network stop" or a firewall 
that drops all traffic on the node. The RM fails over to the standby node when 
the failure is detected as expected.
I am surprised that RM itself fails over (in the context of firewall rule that 
drops traffic) - we never implemented health monitoring like in ZKFC with HDFS. 
It seems like if the rpc port gets blocked the RM will not failover as the 
embedded ZK continues to use the local loop-back and so doesn't detect the 
network failure.

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

2014-09-22 Thread Karthik Kambatla (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143890#comment-14143890
 ] 

Karthik Kambatla commented on YARN-2578:


Would it be possible to add a test case for this?

> NM does not failover timely if RM node network connection fails
> ---
>
> Key: YARN-2578
> URL: https://issues.apache.org/jira/browse/YARN-2578
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Affects Versions: 2.5.1
>Reporter: Wilfred Spiegelenburg
> Attachments: YARN-2578.patch
>
>
> The NM does not fail over correctly when the network cable of the RM is 
> unplugged or the failure is simulated by a "service network stop" or a 
> firewall that drops all traffic on the node. The RM fails over to the standby 
> node when the failure is detected as expected. The NM should than re-register 
> with the new active RM. This re-register takes a long time (15 minutes or 
> more). Until then the cluster has no nodes for processing and applications 
> are stuck.
> Reproduction test case which can be used in any environment:
> - create a cluster with 3 nodes
> node 1: ZK, NN, JN, ZKFC, DN, RM, NM
> node 2: ZK, NN, JN, ZKFC, DN, RM, NM
> node 3: ZK, JN, DN, NM
> - start all services make sure they are in good health
> - kill the network connection of the RM that is active using one of the 
> network kills from above
> - observe the NN and RM failover
> - the DN's fail over to the new active NN
> - the NM does not recover for a long time
> - the logs show a long delay and traces show no change at all
> The stack traces of the NM all show the same set of threads. The main thread 
> which should be used in the re-register is the "Node Status Updater" This 
> thread is stuck in:
> {code}
> "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in 
> Object.wait() [0x7f5a51fc1000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at java.lang.Object.wait(Object.java:503)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1362)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
>   at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source)
>   at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> {code}
> The client connection which goes through the proxy can be traced back to the 
> ResourceTrackerPBClientImpl. The generated proxy does not time out and we 
> should be using a version which takes the RPC timeout (from the 
> configuration) as a parameter.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails

22 matches

Site Navigation

Mail list logo

Footer information