Junping Du created YARN-4288:
--------------------------------

             Summary: NodeManager restart should keep retrying to register to 
RM while connection exception happens during RM restart
                 Key: YARN-4288
                 URL: https://issues.apache.org/jira/browse/YARN-4288
             Project: Hadoop YARN
          Issue Type: Bug
          Components: nodemanager
    Affects Versions: 2.6.0
            Reporter: Junping Du
            Assignee: Junping Du
            Priority: Critical


When NM get restarted, NodeStatusUpdaterImpl will try to register to RM with 
RPC which could throw following exceptions when RM get restarted at the same 
time, like following exception shows:
{noformat}
2015-08-17 14:35:59,434 ERROR nodemanager.NodeStatusUpdaterImpl 
(NodeStatusUpdaterImpl.java:rebootNodeStatusUpdaterAndRegisterWithRM(222)) - 
Unexpected error rebooting NodeStatusUpdater
java.io.IOException: Failed on local exception: java.io.IOException: Connection 
reset by peer; Host Details : local host is: "172.27.62.28"; destination host 
is: "172.27.62.57":8025;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
        at org.apache.hadoop.ipc.Client.call(Client.java:1473)
        at org.apache.hadoop.ipc.Client.call(Client.java:1400)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
2015-08-17 14:35:59,436 FATAL nodemanager.NodeManager 
(NodeManager.java:run(307)) - Error while rebooting NodeStatusUpdater.
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.io.IOException: 
Failed on local exception: java.io.IOException: Connection reset by peer; Host 
Details : local host is: "172.27.62.28"; destination host is: 
"172.27.62.57":8025;
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:223)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeManager$2.run(NodeManager.java:304)
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: 
Connection reset by peer; Host Details : local host is: 
"ebdp-ch2-172.27.62.28"; destination host is: "172.27.62.57":8025;
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:772)
        at org.apache.hadoop.ipc.Client.call(Client.java:1473)
        at org.apache.hadoop.ipc.Client.call(Client.java:1400)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
        at com.sun.proxy.$Proxy36.registerNodeManager(Unknown Source)
        at 
org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.registerNodeManager(ResourceTrackerPBClientImpl.java:68)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy37.registerNodeManager(Unknown Source)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:257)
        at 
org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.rebootNodeStatusUpdaterAndRegisterWithRM(NodeStatusUpdaterImpl.java:215)
        ... 1 more
Caused by: java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at 
org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at java.io.FilterInputStream.read(FilterInputStream.java:133)
        at 
org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:514)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1072)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:967)
2015-08-17 14:35:59,445 INFO  mortbay.log (Slf4jLog.java:info(67)) - Stopped 
HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042
2015-08-17 14:35:59,547 INFO  containermanager.ContainerManagerImpl 
(ContainerManagerImpl.java:cleanUpApplicationsOnNMShutDown(512)) - Applications 
still running : [application_1439417357296_45357, 
application_1439417357296_45403, application_1439417357296_45355, 
application_1439417357296_45111, application_1439417357296_45452, 
application_1439417357296_45350, application_1439417357296_45499, 
application_1439417357296_45205, application_1439417357296_21009]
2015-08-17 14:35:59,548 INFO  ipc.Server (Server.java:stop(2469)) - Stopping 
server on 45454
2015-08-17 14:35:59,551 INFO  ipc.Server (Server.java:run(717)) - Stopping IPC 
Server listener on 45454
2015-08-17 14:35:59,551 INFO  logaggregation.LogAggregationService 
(LogAggregationService.java:serviceStop(141)) - 
org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService
 waiting for pending aggregation during exit
2015-08-17 14:35:59,552 INFO  ipc.Server (Server.java:run(843)) - Stopping IPC 
Server Responder
{noformat}
It will make NM restart get failed. We should have a simple fix to allow this 
register to RM can retry with connection failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to