Hey everyone, I have encountered a troubling issue caused by a node in my cluster dying. I had a node die due to a hardware issue while several MR jobs were running on the cluster, which is running YARN. I noticed that these jobs took over four hours longer than expected to finish. After investigating I found that the ApplicationMaster for these jobs was retrying to connect to the node that had died for those four hours. I see this repeated in the AM logs for that entire period:
2015-03-14 07:07:28,435 INFO [ContainerLauncher #293] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20 2015-03-14 07:07:28,545 INFO [ContainerLauncher #235] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20 2015-03-14 07:07:29,202 INFO [ContainerLauncher #261] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20 2015-03-14 07:07:31,074 INFO [ContainerLauncher #283] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20 2015-03-14 07:07:31,110 INFO [ContainerLauncher #278] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20 2015-03-14 07:07:46,093 INFO [ContainerLauncher #167] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20 2015-03-14 07:07:48,455 INFO [ContainerLauncher #293] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20 2015-03-14 07:07:49,223 INFO [ContainerLauncher #261] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20 2015-03-14 07:07:51,095 INFO [ContainerLauncher #283] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20 2015-03-14 07:07:51,116 INFO [ContainerLauncher #278] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20 2015-03-14 07:08:06,097 INFO [ContainerLauncher #167] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20 2015-03-14 07:08:08,476 INFO [ContainerLauncher #293] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20 2015-03-14 07:08:09,243 INFO [ContainerLauncher #261] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20 2015-03-14 07:08:11,115 INFO [ContainerLauncher #283] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20 2015-03-14 07:08:11,120 INFO [ContainerLauncher #278] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20 2015-03-14 07:08:18,569 INFO [ContainerLauncher #235] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20 2015-03-14 07:08:26,118 INFO [ContainerLauncher #167] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20 2015-03-14 07:08:28,495 INFO [ContainerLauncher #293] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20 2015-03-14 07:08:29,264 INFO [ContainerLauncher #261] org.apache.hadoop.ipc.Client: Retrying connect to server: dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20 Eventually the following exception appeared in the AM logs and the job completed successfully: 2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1423675803126_127109_m_001910_0: cleanup failed for container container_1423675803126_127109_01_004637 : org.apache.hadoop.net.ConnectTimeoutException: Call From am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=dead.node.host/dead.node.ip:47936]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749) at org.apache.hadoop.ipc.Client.call(Client.java:1415) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy39.stopContainers(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy40.stopContainers(Unknown Source) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=dead.node.host/dead.node.ip:47936] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463) at org.apache.hadoop.ipc.Client.call(Client.java:1382) ... 15 more It looks to me that the tasks that had been running on the dead node were restarted, and the AM attempting to clean up those tasks. However, since the node was dead it would not be able to connect. I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5 minutes) and ipc.client.connect.max.retries.on.timeouts set to 20. I see it retry the connection 20 times in the logs, but then it starts retrying from 0 again. Also, I would expect the AM to give up the attempt to connect much sooner. For instance, the ResourceManager recognized the node as dead after 10 minutes as expected. I'd like to see the AM do the same. Has anyone encountered this behavior before? Thanks! -- Andrew Johnson