[ 
https://issues.apache.org/jira/browse/YARN-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

luhuichun reassigned YARN-5006:
-------------------------------

    Assignee: luhuichun

> ResourceManager quit due to ApplicationStateData exceed the limit  size of 
> znode in zk
> --------------------------------------------------------------------------------------
>
>                 Key: YARN-5006
>                 URL: https://issues.apache.org/jira/browse/YARN-5006
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0, 2.7.2
>            Reporter: dongtingting
>            Assignee: luhuichun
>            Priority: Critical
>
> Client submit a job, this job add 10000 file into DistributedCache. when the 
> job is submitted, ResourceManager sotre ApplicationStateData into zk. 
> ApplicationStateData  is exceed the limit size of znode. RM exit 1.   
> The related code in RMStateStore.java :
> {code}
>   private static class StoreAppTransition
>       implements SingleArcTransition<RMStateStore, RMStateStoreEvent> {
>     @Override
>     public void transition(RMStateStore store, RMStateStoreEvent event) {
>       if (!(event instanceof RMStateStoreAppEvent)) {
>         // should never happen
>         LOG.error("Illegal event type: " + event.getClass());
>         return;
>       }
>       ApplicationState appState = ((RMStateStoreAppEvent) 
> event).getAppState();
>       ApplicationId appId = appState.getAppId();
>       ApplicationStateData appStateData = ApplicationStateData
>           .newInstance(appState);
>       LOG.info("Storing info for app: " + appId);
>       try {  
>         store.storeApplicationStateInternal(appId, appStateData);  //store 
> the appStateData
>         store.notifyApplication(new RMAppEvent(appId,
>                RMAppEventType.APP_NEW_SAVED));
>       } catch (Exception e) {
>         LOG.error("Error storing app: " + appId, e);
>         store.notifyStoreOperationFailed(e);   //handle fail event, system 
> exit 
>       }
>     };
>   }
> {code}
> The Exception log:
> {code}
>  ...
> 2016-04-20 11:26:35,732 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore 
> AsyncDispatcher event handler: Maxed out ZK retries. Giving up!
> 2016-04-20 11:26:35,732 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore 
> AsyncDispatcher event handler: Error storing app: 
> application_1461061795989_17671
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
>         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:860)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:855)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:724)
>    ...
> 2016-04-20 11:26:45,613 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager AsyncDispatcher 
> event handler: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode 
> = ConnectionLoss
>         at 
> org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>         at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:931)
>         at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:936)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:933)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1075)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1096)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:933)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:947)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.createWithRetries(ZKRMStateStore.java:956)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.storeApplicationStateInternal(ZKRMStateStore.java:626)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:138)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$StoreAppTransition.transition(RMStateStore.java:123)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore
> .java:860)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:855)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>         at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>         at java.lang.Thread.run(Thread.java:724)
> 2016-04-20 11:26:45,615 INFO org.apache.hadoop.util.ExitUtil AsyncDispatcher 
> event handler: Exiting with status 1
> 2016-04-20 11:26:45,622 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager
>  Thread[Thread-17,5,main]: ExpiredTokenRemover received 
> java.lang.InterruptedException: sleep interrupted
> 2016-04-20 11:26:45,623 INFO org.mortbay.log Thread-1: Stopped 
> HttpServer2$SelectChannelConnectorWithSafeStartup@10.0.0.1:9088
> 2016-04-20 11:26:45,623 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager
>  Thread[Thread-21,5,main]: ExpiredTokenRemover received 
> java.lang.InterruptedException: sleep interrupted
> 2016-04-20 11:26:45,624 ERROR 
> org.apache.hadoop.security.token.delegation.AbstractDelegationTokenSecretManager
>  Thread[Thread-19,5,main]: ExpiredTokenRemover received 
> java.lang.InterruptedException: sleep interrupted
> 2016-04-20 11:26:45,724 INFO org.apache.hadoop.ipc.Server Thread-1: Stopping 
> server on 9033
> 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ipc.Server IPC Server listener 
> on 9033: Stopping IPC Server listener on 9033
> 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ha.ActiveStandbyElector 
> Thread-1: Yielding from election
> 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ipc.Server IPC Server 
> Responder: Stopping IPC Server Responder
> 2016-04-20 11:26:45,725 INFO org.apache.hadoop.ha.ActiveStandbyElector 
> Thread-1: Deleting bread-crumb of active node...
> 2016-04-20 11:26:45,729 INFO org.apache.zookeeper.ZooKeeper Thread-1: 
> Session: 0x2504c1df9409094 closed
> 2016-04-20 11:26:45,729 WARN org.apache.hadoop.ha.ActiveStandbyElector 
> main-EventThread: Ignoring stale result from old client with sessionId 
> 0x2504c1df9409094
> 2016-04-20 11:26:45,729 INFO org.apache.zookeeper.ClientCnxn 
> main-EventThread: EventThread shut down
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to