[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630535#comment-14630535 ] Masatake Iwasaki commented on YARN-2578: bq. 2. Would you tell me why Client.getRpcTimeout returns 0 if ipc.client.ping is false? Just to make it clear that the timeout has no effect without setting {{ipc.client.ping}} to true. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630532#comment-14630532 ] Masatake Iwasaki commented on YARN-2578: Yes, it is the same fix. I agree it should fixed in hadoop-common JIRA. Thanks. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629940#comment-14629940 ] Ming Ma commented on YARN-2578: --- Thanks [~iwasakims]. Is it similar to HADOOP-11252? Given your latest patch is in hadoop-common, it might be better to fix it as a HADOOP jira instead. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629250#comment-14629250 ] Akira AJISAKA commented on YARN-2578: - Thanks [~iwasakims] for creating the patch. One comment and one question from me. bq. The default value is 0 in order to keep current behaviour. 1. We would like to fix this bug, so default to 1min is good for me. 2. Would you tell me why {{Client.getRpcTimeout}} returns 0 if {{ipc.client.ping}} is false? > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.002.patch, YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578948#comment-14578948 ] Masatake Iwasaki commented on YARN-2578: Note: The patch works because PingInputStream throws exception on socket timeout by setting rpcTimeout > 0. The {{rpcTimeout}} given as a argument of {{RPC#getProtocolProxy}} has effect only if {{ipc.client.ping}} is true. {code} private void handleTimeout(SocketTimeoutException e) throws IOException { if (shouldCloseConnection.get() || !running.get() || rpcTimeout > 0) { throw e; } else { sendPing(); } } {code} {{Client}} already have {{getTimeout}} method and it returns -1 by default because the default value of {{ipc.client.ping}} is true. {code} final public static int getTimeout(Configuration conf) { if (!conf.getBoolean(CommonConfigurationKeys.IPC_CLIENT_PING_KEY, CommonConfigurationKeys.IPC_CLIENT_PING_DEFAULT)) { return getPingInterval(conf); } return -1; } {code} I think changing this to always return ping interval and using it as default value of {{rpcTimeout}} is a option but it has wider effect because {{Client.getTimeout}} is used as {{DfsClientConf.hdfsTimeout}}. {noformat} getTimeout 1301 ../../../../../../test/java/org/apache/hadoop/ipc/TestIPC.java assertEquals(Client.getTimeout(config), -1); getTimeout413 ../../../../../../../../../hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/NameNodeProxies.java org.apache.hadoop.ipc.Client.getTimeout(conf), defaultPolicy, getTimeout106 ../../../../../../../../../hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/DfsClientConf.java hdfsTimeout = Client.getTimeout(conf); {noformat} > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572143#comment-14572143 ] Masatake Iwasaki commented on YARN-2578: Hi [~wilfreds], do you have any update on this? I saw the same issue in our cluster and the attached patch worked. I would like the fix to comes in the next release. If you do not have enough time, I would like to take over. Otherwise we can commit the current patch and fix hadoop-common later. It still applies to trunk and branch-2. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg >Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527160#comment-14527160 ] Wilfred Spiegelenburg commented on YARN-2578: - yes I am working on this at the moment. I am also travelling but should have a new patch ready this weekend > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516201#comment-14516201 ] Hadoop QA commented on YARN-2578: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 14m 40s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | javac | 7m 36s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 37s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 5m 24s | The applied patch generated 1 additional checkstyle issues. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 0m 52s | The patch does not introduce any new Findbugs (version 2.0.3) warnings. | | {color:green}+1{color} | yarn tests | 0m 25s | Tests passed in hadoop-yarn-server-common. | | | | 41m 9s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / feb68cb | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/7518/artifact/patchprocess/checkstyle-result-diff.txt | | hadoop-yarn-server-common test log | https://builds.apache.org/job/PreCommit-YARN-Build/7518/artifact/patchprocess/testrun_hadoop-yarn-server-common.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/7518/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/7518/console | This message was automatically generated. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Sou
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221917#comment-14221917 ] Rohith commented on YARN-2578: -- bq. We can consider scenario for simulation that mentioned in YARN-1081 The correct issue id is YARN-1061 > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221915#comment-14221915 ] Rohith commented on YARN-2578: -- Apologies that I linked to wrong issue previously. I deleted wrong link and linking to correct issue. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221913#comment-14221913 ] Rohith commented on YARN-2578: -- We can consider scenario for simulation that mentioned in YARN-1081. Active RM can be made to hang up using linux command *kill -stop * > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14215235#comment-14215235 ] Karthik Kambatla commented on YARN-2578: Since the RM doesn't (yet) suffer from the GC issues that NN does, we thought the embedded-elector is good enough. So far, it appears to be doing okay and we didn't see a particular need for ZKFC. There is an open JIRA for ZKFC (and I have an implementation that needs cleaning up), we can get that done if the need arises. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14214806#comment-14214806 ] Harsh J commented on YARN-2578: --- bq. We never implemented health monitoring like in ZKFC with HDFS Was this not desired for some reason, or just punted in the early implementation? Seems worthy to always have such a thing. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180144#comment-14180144 ] Ming Ma commented on YARN-2578: --- Yeah, it is more than just * -> RM, it could be * -> NM and * -> AM. Agree it is better to fix it at hadoop common layer. From HDFS-4858, it looks like the concern of fixing it at hadoop common layer is the test coverage. Is there any follow up on hadoop common? Perhaps we can fix hadoop common layer so that rpc timeout is still off by default; but if ping is set to false, then rpc timeout will be set to the ping value in the code Karthik refers to. In that way, YARN and MR don't need to change and people can experiment with rpc timeout. After enough test coverage, we can then set ping default value to false. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154084#comment-14154084 ] Vinod Kumar Vavilapalli commented on YARN-2578: --- bq. Instead of fixing it everywhere, how about we fix this in RPC itself? In https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488, instead of using 0 as the default value, the default could be looked up in the Configuration. No? +1. The default from conf is 1min. Assuming it all boils down the ping interval, we should fix it in common. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148452#comment-14148452 ] Wilfred Spiegelenburg commented on YARN-2578: - I proposed fixing the RPC code and by default set the timeout in HDFS-4858 but there was no interest to fix the client (at that point in time). So we now have to fix it everywhere unless we can get everyone on board and get the behaviour changed in the RPC code. The comments are still in that jira and it would be a straight forward fix in the RPC code. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14148381#comment-14148381 ] Karthik Kambatla commented on YARN-2578: Thanks for the clarification, Wilfred. Don't we need to the same from AM->RM and Client->RM as well? Instead of fixing it everywhere, how about we fix this in RPC itself? In https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/RPC.java#L488, instead of using 0 as the default value, the default could be looked up in the Configuration. No? > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144022#comment-14144022 ] Wilfred Spiegelenburg commented on YARN-2578: - I looked into automated testing but like in HDFS-4858 I have not been able to find a way to test this using junit tests. Manual testing is really simple using the above reproduction scenario. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144016#comment-14144016 ] Wilfred Spiegelenburg commented on YARN-2578: - To address [~vinodkv] comments: The active RM is completely shut off from the network so are all the other services on the node, including zookeeper. The RM can update zookeeper but that will never be propagated outside of the node to the other zookeeper nodes. It can thus not be seen by the standby RM. The standby RM detects no updates in zookeeper for the timeout period and becomes the active node. That is the normal HA behaviour from the standby node as if the RM would have crashed. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144004#comment-14144004 ] Hadoop QA commented on YARN-2578: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12670359/YARN-2578.patch against trunk revision 23e17ce. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5071//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5071//console This message is automatically generated. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143951#comment-14143951 ] Vinod Kumar Vavilapalli commented on YARN-2578: --- bq. The NM does not fail over correctly when the network cable of the RM is unplugged or the failure is simulated by a "service network stop" or a firewall that drops all traffic on the node. The RM fails over to the standby node when the failure is detected as expected. I am surprised that RM itself fails over (in the context of firewall rule that drops traffic) - we never implemented health monitoring like in ZKFC with HDFS. It seems like if the rpc port gets blocked the RM will not failover as the embedded ZK continues to use the local loop-back and so doesn't detect the network failure. > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2578) NM does not failover timely if RM node network connection fails
[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143890#comment-14143890 ] Karthik Kambatla commented on YARN-2578: Would it be possible to add a test case for this? > NM does not failover timely if RM node network connection fails > --- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager >Affects Versions: 2.5.1 >Reporter: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x7f5a6cc99800 nid=0x18d0 in > Object.wait() [0x7f5a51fc1000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0xed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)