[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113067#comment-14113067
 ] 

Karthik Kambatla commented on YARN-2459:
----------------------------------------

Stack Trace from Mayank: 
{noformat}
2014-08-24 18:43:04,603 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Skipping scheduling since node phxaishdc9dn0360.phx.ebay.com:58458 is reserved 
by applica 
tion appattempt_1408727267637_12984_000001 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Trying to fulfill reservation for application application_1408727267637_12984 
on node: ph 
xaishdc9dn0816.phx.ebay.com:50443 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 Application application_1408727267637_12984 reserved container 
container_1408727267637_1 
2984_01_003215 on node host: phxaishdc9dn0816.phx.ebay.com:50443 #containers=17 
available=4224 used=63360, currently has 310 at priority 10; currentReservation 
2618880 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
 Updated reserved container container_1408727267637_12984_01_003215 on node 
host: phxai 
shdc9dn0816.phx.ebay.com:50443 #containers=17 available=4224 used=63360 for 
application 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp@2da03710
 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Reserved container application=application_1408727267637_12984 
resource=<memory:8448, vCores:1> 
queue=hdmi-set: capacity=0.2, absoluteCapacity=0.2, 
usedResources=<memory:34293248, vCores:7092>usedCapacity=1.4031365, 
absoluteUsedCapacity=0.28062728, numApps=12, numContainers=7092 
usedCapacity=1.403 
1365 absoluteUsedCapacity=0.28062728 used=<memory:34293248, vCores:7092> 
cluster=<memory:122202112, vCores:14584> 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Skipping scheduling since node phxaishdc9dn0816.phx.ebay.com:50443 is reserved 
by applica 
tion appattempt_1408727267637_12984_000001 
2014-08-24 18:43:04,614 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause: 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) 
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) 
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:852)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:849)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:948)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:967)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:849)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:642)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:181)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:167)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:766)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:837)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:832)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:745) 

2014-08-24 18:43:04,647 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1 
2014-08-24 18:43:04,732 INFO org.mortbay.log: Stopped 
sslsocketconnec...@apollo-phx-rm-1.vip.ebay.com:50030 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping server on 
8033 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Deleting bread-crumb of active node... 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 8033 
2014-08-24 18:43:04,860 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x247d490810c69e8 closed 
2014-08-24 18:43:04,860 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x247d490810c69e8 
2014-08-24 18:43:04,861 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down 
2014-08-24 18:43:10,376 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:
{noformat} 

> RM crashes if App gets rejected for any reason and HA is enabled
> ----------------------------------------------------------------
>
>                 Key: YARN-2459
>                 URL: https://issues.apache.org/jira/browse/YARN-2459
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.1
>            Reporter: Mayank Bansal
>            Assignee: Mayank Bansal
>         Attachments: YARN-2459-1.patch
>
>
> If RM HA is enabled and used Zookeeper store for RM State Store.
> If for any reason Any app gets rejected and directly goes to NEW to FAILED
> then final transition makes that to RMApps and Completed Apps memory 
> structure but that doesn't make it to State store.
> Now when RMApps default limit reaches it starts deleting apps from memory and 
> store. In that case it try to delete this app from store and fails which 
> causes RM to crash.
> Thanks,
> Mayank



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to