[ https://issues.apache.org/jira/browse/YARN-2578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578948#comment-14578948 ]
Masatake Iwasaki commented on YARN-2578: ---------------------------------------- Note: The patch works because PingInputStream throws exception on socket timeout by setting rpcTimeout > 0. The {{rpcTimeout}} given as a argument of {{RPC#getProtocolProxy}} has effect only if {{ipc.client.ping}} is true. {code} private void handleTimeout(SocketTimeoutException e) throws IOException { if (shouldCloseConnection.get() || !running.get() || rpcTimeout > 0) { throw e; } else { sendPing(); } } {code} {{Client}} already have {{getTimeout}} method and it returns -1 by default because the default value of {{ipc.client.ping}} is true. {code} final public static int getTimeout(Configuration conf) { if (!conf.getBoolean(CommonConfigurationKeys.IPC_CLIENT_PING_KEY, CommonConfigurationKeys.IPC_CLIENT_PING_DEFAULT)) { return getPingInterval(conf); } return -1; } {code} I think changing this to always return ping interval and using it as default value of {{rpcTimeout}} is a option but it has wider effect because {{Client.getTimeout}} is used as {{DfsClientConf.hdfsTimeout}}. {noformat} getTimeout 1301 ../../../../../../test/java/org/apache/hadoop/ipc/TestIPC.java assertEquals(Client.getTimeout(config), -1); getTimeout 413 ../../../../../../../../../hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/NameNodeProxies.java org.apache.hadoop.ipc.Client.getTimeout(conf), defaultPolicy, getTimeout 106 ../../../../../../../../../hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/client/impl/DfsClientConf.java hdfsTimeout = Client.getTimeout(conf); {noformat} > NM does not failover timely if RM node network connection fails > --------------------------------------------------------------- > > Key: YARN-2578 > URL: https://issues.apache.org/jira/browse/YARN-2578 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.5.1 > Reporter: Wilfred Spiegelenburg > Assignee: Wilfred Spiegelenburg > Attachments: YARN-2578.patch > > > The NM does not fail over correctly when the network cable of the RM is > unplugged or the failure is simulated by a "service network stop" or a > firewall that drops all traffic on the node. The RM fails over to the standby > node when the failure is detected as expected. The NM should than re-register > with the new active RM. This re-register takes a long time (15 minutes or > more). Until then the cluster has no nodes for processing and applications > are stuck. > Reproduction test case which can be used in any environment: > - create a cluster with 3 nodes > node 1: ZK, NN, JN, ZKFC, DN, RM, NM > node 2: ZK, NN, JN, ZKFC, DN, RM, NM > node 3: ZK, JN, DN, NM > - start all services make sure they are in good health > - kill the network connection of the RM that is active using one of the > network kills from above > - observe the NN and RM failover > - the DN's fail over to the new active NN > - the NM does not recover for a long time > - the logs show a long delay and traces show no change at all > The stack traces of the NM all show the same set of threads. The main thread > which should be used in the re-register is the "Node Status Updater" This > thread is stuck in: > {code} > "Node Status Updater" prio=10 tid=0x00007f5a6cc99800 nid=0x18d0 in > Object.wait() [0x00007f5a51fc1000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call) > at java.lang.Object.wait(Object.java:503) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > - locked <0x00000000ed62f488> (a org.apache.hadoop.ipc.Client$Call) > at org.apache.hadoop.ipc.Client.call(Client.java:1362) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > at com.sun.proxy.$Proxy26.nodeHeartbeat(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > {code} > The client connection which goes through the proxy can be traced back to the > ResourceTrackerPBClientImpl. The generated proxy does not time out and we > should be using a version which takes the RPC timeout (from the > configuration) as a parameter. -- This message was sent by Atlassian JIRA (v6.3.4#6332)