[ https://issues.apache.org/jira/browse/YARN-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975226#comment-14975226 ]
Junping Du commented on YARN-4288: ---------------------------------- Thanks [~vinodkv] for the comments. bq. Why isn't existing RMProxy framework taking care of this? RMProxy is supposed to take care of this. However, the way that RMProxy to do is to do retry on specific (known) exceptions but fail directly for other exceptions. Like this case, IOException get thrown will get failed directly without any retry (for non-HA case). We are a little risky if more potential exception could get thrown during RM down time. For this particular case, I can add the IOException (other than RemoteException) to be handled directly which sounds a easy way of fix. bq. Why are we putting special code in NodeStatusUpdater? Shouldn't we use something in the RMProxy framework? See ServerProxy for example that gets used by NMClients. As I mentioned above, having a white list of exceptions to retry doesn't sound robust enough: if any exception we don't meet before, we could skip the retry unintentionally. Isn't it? Anyway, I could fix the problem with following existing retry policy framework but hopefully we could improve the framework in other JIRA. bq. Just looked at YARN-4132 too, we should definitely see if we can merge these two together. This is a bug that NM doesn't retry in some cases. YARN-4132 talk about another problem that NM retry should be longer than general RMProxy client which is a more general improvement. I think we'd better separate them out. Thoughts? > NodeManager restart should keep retrying to register to RM while connection > exception happens during RM failed over. > -------------------------------------------------------------------------------------------------------------------- > > Key: YARN-4288 > URL: https://issues.apache.org/jira/browse/YARN-4288 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Affects Versions: 2.6.0 > Reporter: Junping Du > Assignee: Junping Du > Priority: Critical > Attachments: YARN-4288.patch > > > When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with > RPC which could throw following exceptions when RM get restarted at the same > time, like following exception shows: > {noformat} > 2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl > (NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222)) - > Unexpected error rebooting NodeStatusUpdater > java.io.IOException: Failed on local exception: java.io.IOException: > Connection reset by peer; Host Details : local host is: "172.27.62.28"; > destination host is: "172.27.62.57":8025; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1473) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304) > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967) > 2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager > (NodeManager.java:run(307)) - Error while rebooting NodeStatusUpdater. > org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: > Failed on local exception: java.io.IOException: Connection reset by peer; > Host Details : local host is: "172.27.62.28"; destination host is: > "172.27.62.57":8025; > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223) > at > org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304) > Caused by: java.io.IOException: Failed on local exception: > java.io.IOException: Connection reset by peer; Host Details : local host is: > "ebdp-ch2-172.27.62.28"; destination host is: "172.27.62.57":8025; > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772) > at org.apache.hadoop.ipc.Client.call(Client.java:1473) > at org.apache.hadoop.ipc.Client.call(Client.java:1400) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257) > at > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215) > ... 1 more > Caused by: java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) > at > org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57) > at > org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) > at > org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at java.io.FilterInputStream.read(FilterInputStream.java:133) > at > org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.DataInputStream.readInt(DataInputStream.java:387) > at > org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967) > 2015-08-17 14:35:59,445 INFO mortbay.log (Slf4jLog.java:info(67)) - Stopped > HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042 > 2015-08-17 14:35:59,547 INFO containermanager.ContainerManagerImpl > (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - > Applications still running : [application_1439417357296_45357, > application_1439417357296_45403, application_1439417357296_45355, > application_1439417357296_45111, application_1439417357296_45452, > application_1439417357296_45350, application_1439417357296_45499, > application_1439417357296_45205, application_1439417357296_21009] > 2015-08-17 14:35:59,548 INFO ipc.Server (Server.java:stop(2469)) - Stopping > server on 45454 > 2015-08-17 14:35:59,551 INFO ipc.Server (Server.java:run(717)) - Stopping > IPC Server listener on 45454 > 2015-08-17 14:35:59,551 INFO logaggregation.LogAggregationService > (LogAggregationService.java:serviceStop(141)) - > org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService > waiting for pending aggregation during exit > 2015-08-17 14:35:59,552 INFO ipc.Server (Server.java:run(843)) - Stopping > IPC Server Responder > {noformat} > It will make NM restart get failed. We should have a simple fix to allow this > register to RM can retry with connection failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332)