Re: Tez Code 1 & Tez Code 2

Jianfeng (Jeff) Zhang Mon, 19 Oct 2015 05:47:16 -0700

Hi Dale,

Does it happen frequently ? Does the RM work normally (can still accept new 
jobs) when this happens  ?
>From the logs, it seems AM meet errors when heartbeat with RM. And it switch 
>between 2 RM for a long time. It might be the RM issue, could you check the RM 
>logs ?



Best Regard,
Jeff Zhang


From: <Bradman>, Dale 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Monday, October 19, 2015 at 8:35 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Tez Code 1 & Tez Code 2

I have attached the logs but the crux of it is:

2015-10-13 15:39:32,866 INFO [AMRM Heartbeater thread] 
retry.RetryInvocationHandler: Exception while invoking allocate of class 
ApplicationMasterProtocolPBClientImpl over rm1. Trying to fail over immediately.
java.net.ConnectException: Call From 
EU-LAMP-PROD-M-0068-HADOOP-SLAVE02/10.10.7.125 to 
eu-lamp-prod-xl-0065-hadoop-sec-master:8030 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1473)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy39.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy40.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
... 12 more
2015-10-13 15:39:32,868 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2015-10-13 15:39:32,871 INFO [AMRM Heartbeater thread] 
retry.RetryInvocationHandler: Exception while invoking allocate of class 
ApplicationMasterProtocolPBClientImpl over rm2 after 1 fail over attempts. 
Trying to fail over after sleeping for 37818ms.
java.net.ConnectException: Call From 
EU-LAMP-PROD-M-0068-HADOOP-SLAVE02/10.10.7.125 to 
eu-lamp-prod-xl-0064-hadoop-master:8030 failed on connection exception: 
java.net.ConnectException: Connection refused; For more details see:  
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1473)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy39.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy40.allocate(Unknown Source)
at 
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
at 
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
... 12 more
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: Session timed out, 
lastDAGCompletionTime=1444746866023 ms, sessionTimeoutInterval=300000 ms
2015-10-13 15:39:43,938 INFO [Timer-1] rm.TaskSchedulerEventHandler: 
TaskScheduler notified that it should unregister from RM
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: No current running 
DAG, shutting down the AM
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: 
DAGAppMasterShutdownHandler invoked
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: Handling DAGAppMaster 
shutdown
2015-10-13 15:39:43,939 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for 
5 seconds before shutting down
2015-10-13 15:39:48,939 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop 
for all the services
2015-10-13 15:39:48,940 INFO [AMShutdownThread] history.HistoryEventHandler: 
Stopping HistoryEventHandler
2015-10-13 15:39:48,941 INFO [AMShutdownThread] recovery.RecoveryService: 
Stopping RecoveryService
2015-10-13 15:39:48,941 INFO [AMShutdownThread] recovery.RecoveryService: 
Closing Summary Stream
2015-10-13 15:39:48,941 INFO [RecoveryEventHandlingThread] 
recovery.RecoveryService: EventQueue take interrupted. Returning
2015-10-13 15:39:48,951 INFO [AMShutdownThread] ats.ATSHistoryLoggingService: 
Stopping ATSService, eventQueueBacklog=0
2015-10-13 15:39:48,952 INFO [DelayedContainerManager] 
rm.YarnTaskSchedulerService: AllocatedContainerManager Thread interrupted
2015-10-13 15:39:48,954 INFO [AMShutdownThread] rm.YarnTaskSchedulerService: 
Unregistering application from RM, exitStatus=SUCCEEDED, exitMessage=Session 
stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
, trackingURL=
2015-10-13 15:40:10,689 INFO [AMRM Heartbeater thread] 
client.ConfiguredRMFailoverProxyProvider: Failing over to rm1

As I said, I've got YARN HA installed and the Active/Stanby ResourceManagers 
switched over last week.



Thanks,
Dale
On 19 Oct 2015, at 13:07, Jianfeng (Jeff) Zhang 
<[email protected]<mailto:[email protected]>> wrote:

application_1444742140034_0009


________________________________

Capgemini is a trading name used by the Capgemini Group of companies which 
includes Capgemini UK plc, a company registered in England and Wales (number 
943935) whose registered office is at No. 1, Forge End, Woking, Surrey, GU21 
6DB.
This message contains information that may be privileged or confidential and is 
the property of the Capgemini Group. It is intended only for the person to whom 
it is addressed. If you are not the intended recipient, you are not authorized 
to read, print, retain, copy, disseminate, distribute, or use this message or 
any part thereof. If you receive this message in error, please notify the 
sender immediately and delete all copies of this message.

Re: Tez Code 1 & Tez Code 2

Reply via email to