ApplicationMaster Retrying Connection to Dead Node

Andrew Johnson Sat, 14 Mar 2015 06:57:05 -0700

Hey everyone,

I have encountered a troubling issue caused by a node in my cluster dying.
I had a node die due to a hardware issue while several MR jobs were running
on the cluster, which is running YARN.  I noticed that these jobs took over
four hours longer than expected to finish.  After investigating I found
that the ApplicationMaster for these jobs was retrying to connect to the
node that had died for those four hours.  I see this repeated in the AM
logs for that entire period:


2015-03-14 07:07:28,435 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 16 time(s); maxRetries=20
2015-03-14 07:07:28,545 INFO [ContainerLauncher #235]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
2015-03-14 07:07:29,202 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
2015-03-14 07:07:31,074 INFO [ContainerLauncher #283]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
2015-03-14 07:07:31,110 INFO [ContainerLauncher #278]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
2015-03-14 07:07:46,093 INFO [ContainerLauncher #167]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
2015-03-14 07:07:48,455 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 17 time(s); maxRetries=20
2015-03-14 07:07:49,223 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
2015-03-14 07:07:51,095 INFO [ContainerLauncher #283]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
2015-03-14 07:07:51,116 INFO [ContainerLauncher #278]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 1 time(s); maxRetries=20
2015-03-14 07:08:06,097 INFO [ContainerLauncher #167]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
2015-03-14 07:08:08,476 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 18 time(s); maxRetries=20
2015-03-14 07:08:09,243 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
2015-03-14 07:08:11,115 INFO [ContainerLauncher #283]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20
2015-03-14 07:08:11,120 INFO [ContainerLauncher #278]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 2 time(s); maxRetries=20
2015-03-14 07:08:18,569 INFO [ContainerLauncher #235]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 0 time(s); maxRetries=20
2015-03-14 07:08:26,118 INFO [ContainerLauncher #167]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
2015-03-14 07:08:28,495 INFO [ContainerLauncher #293]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 19 time(s); maxRetries=20
2015-03-14 07:08:29,264 INFO [ContainerLauncher #261]
org.apache.hadoop.ipc.Client: Retrying connect to server:
dead.node.host/dead.node.ip:47936. Already tried 3 time(s); maxRetries=20

Eventually the following exception appeared in the AM logs and the job
completed successfully:
2015-03-14 07:23:09,239 INFO [AsyncDispatcher event handler]
org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics
report from attempt_1423675803126_127109_m_001910_0: cleanup failed for
container container_1423675803126_127109_01_004637 :
org.apache.hadoop.net.ConnectTimeoutException: Call From
am.node.host/am.node.ip to dead.node.host:47936 failed on socket timeout
exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=dead.node.host/dead.node.ip:47936]; For more details see:
http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.GeneratedConstructorAccessor60.newInstance(Unknown Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:749)
at org.apache.hadoop.ipc.Client.call(Client.java:1415)
at org.apache.hadoop.ipc.Client.call(Client.java:1364)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy39.stopContainers(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:110)
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy40.stopContainers(Unknown Source)
at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:206)
at
org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:373)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
timeout while waiting for channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending
remote=dead.node.host/dead.node.ip:47936]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
at org.apache.hadoop.ipc.Client.call(Client.java:1382)
... 15 more

It looks to me that the tasks that had been running on the dead node were
restarted, and the AM attempting to clean up those tasks.  However, since
the node was dead it would not be able to connect.

I have yarn.client.nodemanager-connect.max-wait-ms set to 300000 (5
minutes) and ipc.client.connect.max.retries.on.timeouts set to 20.  I see
it retry the connection 20 times in the logs, but then it starts retrying
from 0 again.  Also, I would expect the AM to give up the attempt to
connect much sooner.  For instance, the ResourceManager recognized the node
as dead after 10 minutes as expected.  I'd like to see the AM do the same.

Has anyone encountered this behavior before?

Thanks!

-- 
Andrew Johnson

ApplicationMaster Retrying Connection to Dead Node

Reply via email to