Rahul Anand created YARN-8855:
---------------------------------
Summary: Application submission fails if one of the sublcluster is
down.
Key: YARN-8855
URL: https://issues.apache.org/jira/browse/YARN-8855
Project: Hadoop YARN
Issue Type: Bug
Reporter: Rahul Anand
If one of sub cluster is down then application keeps on trying multiple times
and then it fails About 30 failover attempts found in the logs. Below is the
detailed exception.
{code:java}
2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Container
container_e03_1538297667953_0005_01_000001 transitioned from
CONTAINER_CLEANEDUP_AFTER_KILL to DONE | ContainerImpl.java:2093
2018-10-08 14:21:21,245 | INFO | NM ContainerManager dispatcher | Removing
container_e03_1538297667953_0005_01_000001 from application
application_1538297667953_0005 | ApplicationImpl.java:512
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping
resource-monitoring for container_e03_1538297667953_0005_01_000001 |
ContainersMonitorImpl.java:932
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Considering
container container_e03_1538297667953_0005_01_000001 for log-aggregation |
AppLogAggregatorImpl.java:538
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Got event
CONTAINER_STOP for appId application_1538297667953_0005 | AuxServices.java:350
2018-10-08 14:21:21,246 | INFO | NM ContainerManager dispatcher | Stopping
container container_e03_1538297667953_0005_01_000001 |
YarnShuffleService.java:295
2018-10-08 14:21:21,247 | WARN | NM Event dispatcher | couldn't find container
container_e03_1538297667953_0005_01_000001 while processing FINISH_CONTAINERS
event | ContainerManagerImpl.java:1660
2018-10-08 14:21:22,248 | INFO | Node Status Updater | Removed completed
containers from NM context: [container_e03_1538297667953_0005_01_000001] |
NodeStatusUpdaterImpl.java:696
2018-10-08 14:21:26,734 | INFO | pool-16-thread-1 | Failing over to the
ResourceManager for SubClusterId: cluster2 |
FederationRMFailoverProxyProvider.java:124
2018-10-08 14:21:26,735 | INFO | pool-16-thread-1 | Flushing subClusters from
cache and rehydrating from store, most likely on account of RM failover. |
FederationStateStoreFacade.java:258
2018-10-08 14:21:26,738 | INFO | pool-16-thread-1 | Connecting to
/192.168.0.25:8032 subClusterId cluster2 with protocol
ApplicationClientProtocol as user root (auth:SIMPLE) |
FederationRMFailoverProxyProvider.java:145
2018-10-08 14:21:26,741 | INFO | pool-16-thread-1 | java.net.ConnectException:
Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on
connection exception: java.net.ConnectException: Connection refused; For more
details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 28
failover attempts. Trying to failover after sleeping for 15261ms. |
RetryInvocationHandler.java:411
2018-10-08 14:21:42,002 | INFO | pool-16-thread-1 | Failing over to the
ResourceManager for SubClusterId: cluster2 |
FederationRMFailoverProxyProvider.java:124
2018-10-08 14:21:42,003 | INFO | pool-16-thread-1 | Flushing subClusters from
cache and rehydrating from store, most likely on account of RM failover. |
FederationStateStoreFacade.java:258
2018-10-08 14:21:42,005 | INFO | pool-16-thread-1 | Connecting to
/192.168.0.25:8032 subClusterId cluster2 with protocol
ApplicationClientProtocol as user root (auth:SIMPLE) |
FederationRMFailoverProxyProvider.java:145
2018-10-08 14:21:42,007 | INFO | pool-16-thread-1 | java.net.ConnectException:
Call From node-core-jIKcN/192.168.0.64 to node-master1-IYTxR:8032 failed on
connection exception: java.net.ConnectException: Connection refused; For more
details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking
ApplicationClientProtocolPBClientImpl.submitApplication over cluster2 after 29
failover attempts. Trying to failover after sleeping for 21175ms. |
RetryInvocationHandler.java:411
2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Failing over to the
ResourceManager for SubClusterId: cluster2 |
FederationRMFailoverProxyProvider.java:124
2018-10-08 14:22:03,183 | INFO | pool-16-thread-1 | Flushing subClusters from
cache and rehydrating from store, most likely on account of RM failover. |
FederationStateStoreFacade.java:258
2018-10-08 14:22:03,186 | INFO | pool-16-thread-1 | Connecting to
/192.168.0.25:8032 subClusterId cluster2 with protocol
ApplicationClientProtocol as user root (auth:SIMPLE) |
FederationRMFailoverProxyProvider.java:145
2018-10-08 14:22:03,189 | ERROR | pool-16-thread-1 | Failed to register
application master: cluster2 Application: appattempt_1538297667953_0005_000001
| FederationInterceptor.java:1106
java.net.ConnectException: Call From node-core-jIKcN/192.168.0.64 to
node-master1-IYTxR:8032 failed on connection exception:
java.net.ConnectException: Connection refused; For more details see:
http://wiki.apache.org/hadoop/ConnectionRefused
at sun.reflect.GeneratedConstructorAccessor59.newInstance(Unknown Source) at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at
org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831) at
org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755) at
org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1517) at
org.apache.hadoop.ipc.Client.call(Client.java:1459)
{code}
cc [~botong]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]