[ https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113067#comment-14113067 ]
Karthik Kambatla commented on YARN-2459: ---------------------------------------- Stack Trace from Mayank: {noformat} 2014-08-24 18:43:04,603 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Skipping scheduling since node phxaishdc9dn0360.phx.ebay.com:58458 is reserved by applica tion appattempt_1408727267637_12984_000001 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Trying to fulfill reservation for application application_1408727267637_12984 on node: ph xaishdc9dn0816.phx.ebay.com:50443 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Application application_1408727267637_12984 reserved container container_1408727267637_1 2984_01_003215 on node host: phxaishdc9dn0816.phx.ebay.com:50443 #containers=17 available=4224 used=63360, currently has 310 at priority 10; currentReservation 2618880 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode: Updated reserved container container_1408727267637_12984_01_003215 on node host: phxai shdc9dn0816.phx.ebay.com:50443 #containers=17 available=4224 used=63360 for application org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp@2da03710 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: Reserved container application=application_1408727267637_12984 resource=<memory:8448, vCores:1> queue=hdmi-set: capacity=0.2, absoluteCapacity=0.2, usedResources=<memory:34293248, vCores:7092>usedCapacity=1.4031365, absoluteUsedCapacity=0.28062728, numApps=12, numContainers=7092 usedCapacity=1.403 1365 absoluteUsedCapacity=0.28062728 used=<memory:34293248, vCores:7092> cluster=<memory:122202112, vCores:14584> 2014-08-24 18:43:04,613 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Skipping scheduling since node phxaishdc9dn0816.phx.ebay.com:50443 is reserved by applica tion appattempt_1408727267637_12984_000001 2014-08-24 18:43:04,614 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:852) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:849) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:948) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:967) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:849) at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:642) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:181) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:167) at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362) at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:766) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:837) at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:832) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) 2014-08-24 18:43:04,647 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 2014-08-24 18:43:04,732 INFO org.mortbay.log: Stopped sslsocketconnec...@apollo-phx-rm-1.vip.ebay.com:50030 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping server on 8033 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: Deleting bread-crumb of active node... 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8033 2014-08-24 18:43:04,860 INFO org.apache.zookeeper.ZooKeeper: Session: 0x247d490810c69e8 closed 2014-08-24 18:43:04,860 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x247d490810c69e8 2014-08-24 18:43:04,861 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2014-08-24 18:43:10,376 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG: {noformat} > RM crashes if App gets rejected for any reason and HA is enabled > ---------------------------------------------------------------- > > Key: YARN-2459 > URL: https://issues.apache.org/jira/browse/YARN-2459 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.4.1 > Reporter: Mayank Bansal > Assignee: Mayank Bansal > Attachments: YARN-2459-1.patch > > > If RM HA is enabled and used Zookeeper store for RM State Store. > If for any reason Any app gets rejected and directly goes to NEW to FAILED > then final transition makes that to RMApps and Completed Apps memory > structure but that doesn't make it to State store. > Now when RMApps default limit reaches it starts deleting apps from memory and > store. In that case it try to delete this app from store and fails which > causes RM to crash. > Thanks, > Mayank -- This message was sent by Atlassian JIRA (v6.2#6252)