[ 
https://issues.apache.org/jira/browse/YARN-2459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Bansal updated YARN-2459:
--------------------------------

    Description: 
If RM HA is enabled and used Zookeeper store for RM State Store.
If for any reason Any app gets rejected and directly goes to NEW to FAILED
then final transition makes that to RMApps and Completed Apps memory structure 
but that doesn't make it to State store.
Now when RMApps default limit reaches it starts deleting apps from memory and 
store. In that case it try to delete this app from store and fails which causes 
RM to crash.

Stack Trace

2014-08-24 18:43:04,603 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Skipping scheduling since node phxaishdc9dn0360.phx.ebay.com:58458 is reserved 
by applica 
tion appattempt_1408727267637_12984_000001 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Trying to fulfill reservation for application application_1408727267637_12984 
on node: ph 
xaishdc9dn0816.phx.ebay.com:50443 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
 Application application_1408727267637_12984 reserved container 
container_1408727267637_1 
2984_01_003215 on node host: phxaishdc9dn0816.phx.ebay.com:50443 #containers=17 
available=4224 used=63360, currently has 310 at priority 10; currentReservation 
2618880 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
 Updated reserved container container_1408727267637_12984_01_003215 on node 
host: phxai 
shdc9dn0816.phx.ebay.com:50443 #containers=17 available=4224 used=63360 for 
application 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp@2da03710
 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
Reserved container application=application_1408727267637_12984 
resource=<memory:8448, vCores:1> 
queue=hdmi-set: capacity=0.2, absoluteCapacity=0.2, 
usedResources=<memory:34293248, vCores:7092>usedCapacity=1.4031365, 
absoluteUsedCapacity=0.28062728, numApps=12, numContainers=7092 
usedCapacity=1.403 
1365 absoluteUsedCapacity=0.28062728 used=<memory:34293248, vCores:7092> 
cluster=<memory:122202112, vCores:14584> 
2014-08-24 18:43:04,613 INFO 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
 Skipping scheduling since node phxaishdc9dn0816.phx.ebay.com:50443 is reserved 
by applica 
tion appattempt_1408727267637_12984_000001 
2014-08-24 18:43:04,614 FATAL 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_OP_FAILED. Cause: 
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) 
at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) 
at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:852)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:849)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:948)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:967)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:849)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:642)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:181)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:167)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
 
at 
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:766)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:837)
 
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:832)
 
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) 
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
at java.lang.Thread.run(Thread.java:745) 

2014-08-24 18:43:04,647 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1 
2014-08-24 18:43:04,732 INFO org.mortbay.log: Stopped 
sslsocketconnec...@apollo-phx-rm-1.vip.ebay.com:50030 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping server on 
8033 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Yielding from election 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
Responder 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
Deleting bread-crumb of active node... 
2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server 
listener on 8033 
2014-08-24 18:43:04,860 INFO org.apache.zookeeper.ZooKeeper: Session: 
0x247d490810c69e8 closed 
2014-08-24 18:43:04,860 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
Ignoring stale result from old client with sessionId 0x247d490810c69e8 
2014-08-24 18:43:04,861 INFO org.apache.zookeeper.ClientCnxn: EventThread shut 
down 
2014-08-24 18:43:10,376 INFO 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:

Thanks,
Mayank

  was:
If RM HA is enabled and used Zookeeper store for RM State Store.
If for any reason Any app gets rejected and directly goes to NEW to FAILED
then final transition makes that to RMApps and Completed Apps memory structure 
but that doesn't make it to State store.
Now when RMApps default limit reaches it starts deleting apps from memory and 
store. In that case it try to delete this app from store and fails which causes 
RM to crash.

Thanks,
Mayank


> RM crashes if App gets rejected for any reason and HA is enabled
> ----------------------------------------------------------------
>
>                 Key: YARN-2459
>                 URL: https://issues.apache.org/jira/browse/YARN-2459
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.1
>            Reporter: Mayank Bansal
>            Assignee: Mayank Bansal
>         Attachments: YARN-2459-1.patch
>
>
> If RM HA is enabled and used Zookeeper store for RM State Store.
> If for any reason Any app gets rejected and directly goes to NEW to FAILED
> then final transition makes that to RMApps and Completed Apps memory 
> structure but that doesn't make it to State store.
> Now when RMApps default limit reaches it starts deleting apps from memory and 
> store. In that case it try to delete this app from store and fails which 
> causes RM to crash.
> Stack Trace
> 2014-08-24 18:43:04,603 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Skipping scheduling since node phxaishdc9dn0360.phx.ebay.com:58458 is 
> reserved by applica 
> tion appattempt_1408727267637_12984_000001 
> 2014-08-24 18:43:04,613 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Trying to fulfill reservation for application 
> application_1408727267637_12984 on node: ph 
> xaishdc9dn0816.phx.ebay.com:50443 
> 2014-08-24 18:43:04,613 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  Application application_1408727267637_12984 reserved container 
> container_1408727267637_1 
> 2984_01_003215 on node host: phxaishdc9dn0816.phx.ebay.com:50443 
> #containers=17 available=4224 used=63360, currently has 310 at priority 10; 
> currentReservation 2618880 
> 2014-08-24 18:43:04,613 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
>  Updated reserved container container_1408727267637_12984_01_003215 on node 
> host: phxai 
> shdc9dn0816.phx.ebay.com:50443 #containers=17 available=4224 used=63360 for 
> application 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp@2da03710
>  
> 2014-08-24 18:43:04,613 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: 
> Reserved container application=application_1408727267637_12984 
> resource=<memory:8448, vCores:1> 
> queue=hdmi-set: capacity=0.2, absoluteCapacity=0.2, 
> usedResources=<memory:34293248, vCores:7092>usedCapacity=1.4031365, 
> absoluteUsedCapacity=0.28062728, numApps=12, numContainers=7092 
> usedCapacity=1.403 
> 1365 absoluteUsedCapacity=0.28062728 used=<memory:34293248, vCores:7092> 
> cluster=<memory:122202112, vCores:14584> 
> 2014-08-24 18:43:04,613 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>  Skipping scheduling since node phxaishdc9dn0816.phx.ebay.com:50443 is 
> reserved by applica 
> tion appattempt_1408727267637_12984_000001 
> 2014-08-24 18:43:04,614 FATAL 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a 
> org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
> STATE_STORE_OP_FAILED. Cause: 
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
> NoNode 
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:111) 
> at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945) 
> at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911) 
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:852)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:849)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:948)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:967)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:849)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:642)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:181)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$RemoveAppTransition.transition(RMStateStore.java:167)
>  
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
>  
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>  
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>  
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:766)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:837)
>  
> at 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:832)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>  
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) 
> at java.lang.Thread.run(Thread.java:745) 
> 2014-08-24 18:43:04,647 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1 
> 2014-08-24 18:43:04,732 INFO org.mortbay.log: Stopped 
> sslsocketconnec...@apollo-phx-rm-1.vip.ebay.com:50030 
> 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping server on 
> 8033 
> 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Yielding from election 
> 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server Responder 
> 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ha.ActiveStandbyElector: 
> Deleting bread-crumb of active node... 
> 2014-08-24 18:43:04,847 INFO org.apache.hadoop.ipc.Server: Stopping IPC 
> Server listener on 8033 
> 2014-08-24 18:43:04,860 INFO org.apache.zookeeper.ZooKeeper: Session: 
> 0x247d490810c69e8 closed 
> 2014-08-24 18:43:04,860 WARN org.apache.hadoop.ha.ActiveStandbyElector: 
> Ignoring stale result from old client with sessionId 0x247d490810c69e8 
> 2014-08-24 18:43:04,861 INFO org.apache.zookeeper.ClientCnxn: EventThread 
> shut down 
> 2014-08-24 18:43:10,376 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: STARTUP_MSG:
> Thanks,
> Mayank



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to