[ 
https://issues.apache.org/jira/browse/YARN-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975226#comment-14975226
 ] 

Junping Du commented on YARN-4288:
----------------------------------

Thanks [~vinodkv] for the comments.
bq. Why isn't existing RMProxy framework taking care of this?
RMProxy is supposed to take care of this. However, the way that RMProxy to do 
is to do retry on specific (known) exceptions but fail directly for other 
exceptions. Like this case, IOException get thrown will get failed directly 
without any retry (for non-HA case). We are a little risky if more potential 
exception could get thrown during RM down time. For this particular case, I can 
add the IOException (other than RemoteException) to be handled directly which 
sounds a easy way of fix.

bq. Why are we putting special code in NodeStatusUpdater? Shouldn't we use 
something in the RMProxy framework? See ServerProxy for example that gets used 
by NMClients.
As I mentioned above, having a white list of exceptions to retry doesn't sound 
robust enough: if any exception we don't meet before, we could skip the retry 
unintentionally. Isn't it? Anyway, I could fix the problem with following 
existing retry policy framework but hopefully we could improve the framework in 
other JIRA.

bq. Just looked at YARN-4132 too, we should definitely see if we can merge 
these two together.
This is a bug that NM doesn't retry in some cases. YARN-4132 talk about another 
problem that NM retry should be longer than general RMProxy client which is a 
more general improvement. I think we'd better separate them out. Thoughts?

> NodeManager restart should keep retrying to register to RM while connection 
> exception happens during RM failed over.
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4288
>                 URL: https://issues.apache.org/jira/browse/YARN-4288
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4288.patch
>
>
> When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with 
> RPC which could throw following exceptions when RM get restarted at the same 
> time, like following exception shows:
> {noformat}
> 2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl 
> (NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222)) - 
> Unexpected error rebooting NodeStatusUpdater
> java.io.IOException: Failed on local exception: java.io.IOException: 
> Connection reset by peer; Host Details : local host is: "172.27.62.28"; 
> destination host is: "172.27.62.57":8025;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1473)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1400)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
> 2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager 
> (NodeManager.java:run(307)) - Error while rebooting NodeStatusUpdater.
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
> Failed on local exception: java.io.IOException: Connection reset by peer; 
> Host Details : local host is: "172.27.62.28"; destination host is: 
> "172.27.62.57":8025;
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
> Caused by: java.io.IOException: Failed on local exception: 
> java.io.IOException: Connection reset by peer; Host Details : local host is: 
> "ebdp-ch2-172.27.62.28"; destination host is: "172.27.62.57":8025;
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1473)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1400)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
>         at 
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
>         ... 1 more
> Caused by: java.io.IOException: Connection reset by peer
>         at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>         at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>         at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>         at sun.nio.ch.IOUtil.read(IOUtil.java:197)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
> 2015-08-17 14:35:59,445 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
> HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
> 2015-08-17 14:35:59,547 INFO  containermanager.ContainerManagerImpl 
> (ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - 
> Applications still running : [application_1439417357296_45357, 
> application_1439417357296_45403, application_1439417357296_45355, 
> application_1439417357296_45111, application_1439417357296_45452, 
> application_1439417357296_45350, application_1439417357296_45499, 
> application_1439417357296_45205, application_1439417357296_21009]
> 2015-08-17 14:35:59,548 INFO  ipc.Server (Server.java:stop(2469)) - Stopping 
> server on 45454
> 2015-08-17 14:35:59,551 INFO  ipc.Server (Server.java:run(717)) - Stopping 
> IPC Server listener on 45454
> 2015-08-17 14:35:59,551 INFO  logaggregation.LogAggregationService 
> (LogAggregationService.java:serviceStop(141)) - 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
>  waiting for pending aggregation during exit
> 2015-08-17 14:35:59,552 INFO  ipc.Server (Server.java:run(843)) - Stopping 
> IPC Server Responder
> {noformat}
> It will make NM restart get failed. We should have a simple fix to allow this 
> register to RM can retry with connection failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to