Hi Dale, Does it happen frequently ? Does the RM work normally (can still accept new jobs) when this happens ? >From the logs, it seems AM meet errors when heartbeat with RM. And it switch >between 2 RM for a long time. It might be the RM issue, could you check the RM >logs ?
Best Regard, Jeff Zhang From: <Bradman>, Dale <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Monday, October 19, 2015 at 8:35 PM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: Tez Code 1 & Tez Code 2 I have attached the logs but the crux of it is: 2015-10-13 15:39:32,866 INFO [AMRM Heartbeater thread] retry.RetryInvocationHandler: Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over rm1. Trying to fail over immediately. java.net.ConnectException: Call From EU-LAMP-PROD-M-0068-HADOOP-SLAVE02/10.10.7.125 to eu-lamp-prod-xl-0065-hadoop-sec-master:8030 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy39.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy40.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 12 more 2015-10-13 15:39:32,868 INFO [AMRM Heartbeater thread] client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 2015-10-13 15:39:32,871 INFO [AMRM Heartbeater thread] retry.RetryInvocationHandler: Exception while invoking allocate of class ApplicationMasterProtocolPBClientImpl over rm2 after 1 fail over attempts. Trying to fail over after sleeping for 37818ms. java.net.ConnectException: Call From EU-LAMP-PROD-M-0068-HADOOP-SLAVE02/10.10.7.125 to eu-lamp-prod-xl-0064-hadoop-master:8030 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy39.allocate(Unknown Source) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy40.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 12 more 2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: Session timed out, lastDAGCompletionTime=1444746866023 ms, sessionTimeoutInterval=300000 ms 2015-10-13 15:39:43,938 INFO [Timer-1] rm.TaskSchedulerEventHandler: TaskScheduler notified that it should unregister from RM 2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: No current running DAG, shutting down the AM 2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: DAGAppMasterShutdownHandler invoked 2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: Handling DAGAppMaster shutdown 2015-10-13 15:39:43,939 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for 5 seconds before shutting down 2015-10-13 15:39:48,939 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop for all the services 2015-10-13 15:39:48,940 INFO [AMShutdownThread] history.HistoryEventHandler: Stopping HistoryEventHandler 2015-10-13 15:39:48,941 INFO [AMShutdownThread] recovery.RecoveryService: Stopping RecoveryService 2015-10-13 15:39:48,941 INFO [AMShutdownThread] recovery.RecoveryService: Closing Summary Stream 2015-10-13 15:39:48,941 INFO [RecoveryEventHandlingThread] recovery.RecoveryService: EventQueue take interrupted. Returning 2015-10-13 15:39:48,951 INFO [AMShutdownThread] ats.ATSHistoryLoggingService: Stopping ATSService, eventQueueBacklog=0 2015-10-13 15:39:48,952 INFO [DelayedContainerManager] rm.YarnTaskSchedulerService: AllocatedContainerManager Thread interrupted 2015-10-13 15:39:48,954 INFO [AMShutdownThread] rm.YarnTaskSchedulerService: Unregistering application from RM, exitStatus=SUCCEEDED, exitMessage=Session stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0 , trackingURL= 2015-10-13 15:40:10,689 INFO [AMRM Heartbeater thread] client.ConfiguredRMFailoverProxyProvider: Failing over to rm1 As I said, I've got YARN HA installed and the Active/Stanby ResourceManagers switched over last week. Thanks, Dale On 19 Oct 2015, at 13:07, Jianfeng (Jeff) Zhang <[email protected]<mailto:[email protected]>> wrote: application_1444742140034_0009 ________________________________ Capgemini is a trading name used by the Capgemini Group of companies which includes Capgemini UK plc, a company registered in England and Wales (number 943935) whose registered office is at No. 1, Forge End, Woking, Surrey, GU21 6DB. This message contains information that may be privileged or confidential and is the property of the Capgemini Group. It is intended only for the person to whom it is addressed. If you are not the intended recipient, you are not authorized to read, print, retain, copy, disseminate, distribute, or use this message or any part thereof. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
