Hi Jeff,
The RM is fairly stable and reliable. As I said, the command works when passing
it through Beeline. Just not in Hue.
Resource manager log snippet:
2015-10-19 09:51:51,811 INFO ipc.Server (Server.java:run(2060)) - IPC Server
handler 32 on 8032, call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
from 10.10.7.223:33554 Call#625674 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application
with id 'application_1444742140034_0009' doesn't exist in RM.
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:324)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:170)
at
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:401)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2015-10-19 09:52:46,900 INFO ipc.Server (Server.java:run(2060)) - IPC Server
handler 25 on 8032, call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
from 10.10.7.223:33599 Call#625743 Retry#0
org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application
with id 'application_1444742140034_0009' doesn't exist in RM.
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getApplicationReport(ClientRMService.java:324)
at
org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getApplicationReport(ApplicationClientProtocolPBServiceImpl.java:170)
at
org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:401)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
2015-10-19 09:53:45,479 INFO ipc.Server (Server.java:run(2060)) - IPC Server
handler 10 on 8032, call
org.apache.hadoop.yarn.api.ApplicationClientProtocolPB.getApplicationReport
from 10.10.7.223:33651 Call#625830 Retry#0
Thanks,
Dale
On 19 Oct 2015, at 13:46, Jianfeng (Jeff) Zhang
<[email protected]<mailto:[email protected]>> wrote:
Hi Dale,
Does it happen frequently ? Does the RM work normally (can still accept new
jobs) when this happens ?
>From the logs, it seems AM meet errors when heartbeat with RM. And it switch
>between 2 RM for a long time. It might be the RM issue, could you check the RM
>logs ?
Best Regard,
Jeff Zhang
From: <Bradman>, Dale
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Date: Monday, October 19, 2015 at 8:35 PM
To: "[email protected]<mailto:[email protected]>"
<[email protected]<mailto:[email protected]>>
Subject: Re: Tez Code 1 & Tez Code 2
I have attached the logs but the crux of it is:
2015-10-13 15:39:32,866 INFO [AMRM Heartbeater thread]
retry.RetryInvocationHandler: Exception while invoking allocate of class
ApplicationMasterProtocolPBClientImpl over rm1. Trying to fail over immediately.
java.net.ConnectException: Call From
EU-LAMP-PROD-M-0068-HADOOP-SLAVE02/10.10.7.125 to
eu-lamp-prod-xl-0065-hadoop-sec-master:8030 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1473)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy39.allocate(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy40.allocate(Unknown Source)
at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
... 12 more
2015-10-13 15:39:32,868 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
2015-10-13 15:39:32,871 INFO [AMRM Heartbeater thread]
retry.RetryInvocationHandler: Exception while invoking allocate of class
ApplicationMasterProtocolPBClientImpl over rm2 after 1 fail over attempts.
Trying to fail over after sleeping for 37818ms.
java.net.ConnectException: Call From
EU-LAMP-PROD-M-0068-HADOOP-SLAVE02/10.10.7.125 to
eu-lamp-prod-xl-0064-hadoop-master:8030 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791)
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731)
at org.apache.hadoop.ipc.Client.call(Client.java:1473)
at org.apache.hadoop.ipc.Client.call(Client.java:1400)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy39.allocate(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:77)
at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy40.allocate(Unknown Source)
at
org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:278)
at
org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$HeartbeatThread.run(AMRMClientAsyncImpl.java:224)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706)
at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522)
at org.apache.hadoop.ipc.Client.call(Client.java:1439)
... 12 more
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: Session timed out,
lastDAGCompletionTime=1444746866023 ms, sessionTimeoutInterval=300000 ms
2015-10-13 15:39:43,938 INFO [Timer-1] rm.TaskSchedulerEventHandler:
TaskScheduler notified that it should unregister from RM
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: No current running
DAG, shutting down the AM
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster:
DAGAppMasterShutdownHandler invoked
2015-10-13 15:39:43,938 INFO [Timer-1] app.DAGAppMaster: Handling DAGAppMaster
shutdown
2015-10-13 15:39:43,939 INFO [AMShutdownThread] app.DAGAppMaster: Sleeping for
5 seconds before shutting down
2015-10-13 15:39:48,939 INFO [AMShutdownThread] app.DAGAppMaster: Calling stop
for all the services
2015-10-13 15:39:48,940 INFO [AMShutdownThread] history.HistoryEventHandler:
Stopping HistoryEventHandler
2015-10-13 15:39:48,941 INFO [AMShutdownThread] recovery.RecoveryService:
Stopping RecoveryService
2015-10-13 15:39:48,941 INFO [AMShutdownThread] recovery.RecoveryService:
Closing Summary Stream
2015-10-13 15:39:48,941 INFO [RecoveryEventHandlingThread]
recovery.RecoveryService: EventQueue take interrupted. Returning
2015-10-13 15:39:48,951 INFO [AMShutdownThread] ats.ATSHistoryLoggingService:
Stopping ATSService, eventQueueBacklog=0
2015-10-13 15:39:48,952 INFO [DelayedContainerManager]
rm.YarnTaskSchedulerService: AllocatedContainerManager Thread interrupted
2015-10-13 15:39:48,954 INFO [AMShutdownThread] rm.YarnTaskSchedulerService:
Unregistering application from RM, exitStatus=SUCCEEDED, exitMessage=Session
stats:submittedDAGs=1, successfulDAGs=1, failedDAGs=0, killedDAGs=0
, trackingURL=
2015-10-13 15:40:10,689 INFO [AMRM Heartbeater thread]
client.ConfiguredRMFailoverProxyProvider: Failing over to rm1
As I said, I’ve got YARN HA installed and the Active/Stanby ResourceManagers
switched over last week.
Thanks,
Dale
On 19 Oct 2015, at 13:07, Jianfeng (Jeff) Zhang
<[email protected]<mailto:[email protected]>> wrote:
application_1444742140034_0009
________________________________
Capgemini is a trading name used by the Capgemini Group of companies which
includes Capgemini UK plc, a company registered in England and Wales (number
943935) whose registered office is at No. 1, Forge End, Woking, Surrey, GU21
6DB.
This message contains information that may be privileged or confidential and is
the property of the Capgemini Group. It is intended only for the person to whom
it is addressed. If you are not the intended recipient, you are not authorized
to read, print, retain, copy, disseminate, distribute, or use this message or
any part thereof. If you receive this message in error, please notify the
sender immediately and delete all copies of this message.
________________________________
Capgemini is a trading name used by the Capgemini Group of companies which
includes Capgemini UK plc, a company registered in England and Wales (number
943935) whose registered office is at No. 1, Forge End, Woking, Surrey, GU21
6DB.