[
https://issues.apache.org/jira/browse/YARN-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509015#comment-16509015
]
Takanobu Asanuma commented on YARN-8416:
----------------------------------------
Moved this jira to YARN project.
> YARN in HA not failing over to a new resource manager.
> ------------------------------------------------------
>
> Key: YARN-8416
> URL: https://issues.apache.org/jira/browse/YARN-8416
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.1
> Reporter: Danil Serdyuchenko
> Priority: Major
>
> We are running YARN in HA mode. (rm1 and rm2) We hit an issue when recreating
> one of the RMs.
> # Recreated a standby RM (rm2), which gave it a new IP
> # Stopped the active RM (rm1)
> # NMs tried to failover to rm2, but were timing out because of the old ip.
> # NMs reach the configured 30 failover retries and shutdown.
> We get the following logs.
> {noformat}
> 18/06/06 15:36:32 WARN ipc.Client: Address change detected. Old:
> yarnrm2/x.x.x.x:8031 New: yarnrm2/y.y.y.y:8031
> 18/06/06 15:36:32 INFO retry.RetryInvocationHandler: Exception while invoking
> nodeHeartbeat of class ResourceTrackerPBClientImpl over rm2 after 25 fail
> over attempts. Trying to fail over after sleeping for 37191ms.
> org.apache.hadoop.net.ConnectTimeoutException: Call From ip-a-a-a-a/a.a.a.a
> to yarnrm2:8031 failed on socket timeout exception:
> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
> waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=yarnrm2/x.x.x.x:8031]; For more details see:
> http://wiki.apache.org/hadoop/SocketTimeout
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
> at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
> at org.apache.hadoop.ipc.Client.call(Client.java:1480)
> at org.apache.hadoop.ipc.Client.call(Client.java:1407)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
> at
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy29.nodeHeartbeat(Unknown Source)
> at
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:596)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
> timeout while waiting for channel to be ready for connect. ch :
> java.nio.channels.SocketChannel[connection-pending
> remote=yarnrm2/x.x.x.x:8031]
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
> at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
> at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
> at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
> at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
> at org.apache.hadoop.ipc.Client.call(Client.java:1446)
> ... 12 more{noformat}
> We get this and failover back to rm1 30 times until:
> {noformat}
> 18/06/06 15:39:44 WARN retry.RetryInvocationHandler: Exception while invoking
> class
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat
> over rm1. Not retrying because failovers (30) exceeded maximum allowed
> (30){noformat}
> From the logs it appears that the timeouts happen because it's trying to
> connect to the old ip (x.x.x.x in the logs). Looking at the code of the
> Client class, following the updateAddress method call we should expect a
> retry with the new server ip ("Retrying connect to server ..." log) up to
> ipc.client.connect.max.retries.on.timeouts times. However we never see the
> retry logs and it just fails with exception. The above setting is set to
> default 45 for all of our NMs.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]