[
https://issues.apache.org/jira/browse/YARN-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Varun Saxena updated YARN-3878:
-------------------------------
Description:
The sequence of events is as under :
# RM is stopped while putting a RMStateStore Event to RMStateStore's
AsyncDispatcher. This leads to an Interrupted Exception being thrown.
# As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On
{{serviceStop}}, we will check if all events have been drained and wait for
event queue to drain(as RM State Store dispatcher is configured for queue to
drain on stop).
# This condition never becomes true and AsyncDispatcher keeps on waiting
incessantly for dispatcher event queue to drain till JVM exits.
*Initial exception while posting RM State store event to queue*
{noformat}
2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService
(AbstractService.java:enterState(452)) - Service: Dispatcher entered state
STOPPED
2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher
thread interrupted
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
at
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
{noformat}
*JStack of AsyncDispatcher hanging on stop*
{noformat}
"AsyncDispatcher event handler" prio=10 tid=0x00007fb980222800 nid=0x4b1e
waiting on condition [0x00007fb9654e9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000700b79250> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
at java.lang.Thread.run(Thread.java:744)
"main" prio=10 tid=0x00007fb98000a800 nid=0x49c3 in Object.wait()
[0x00007fb989851000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000700b79430> (a java.lang.Object)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:156)
- locked <0x0000000700b79430> (a java.lang.Object)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0000000700b79420> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:515)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0000000700b79630> (a java.lang.Object)
at
org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:599)
{noformat}
was:
The sequence of events is as under :
# RM is stopped while putting a RMStateStore Event to RMStateStore's
AsyncDispatcher. This leads to an Interrupted Exception being thrown.
# As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On
{{serviceStop}}, we will check if all events have been drained and wait for
event queue to drain(as RM State Store dispatcher is configured for queue to
drain on stop).
# This condition never becomes true and AsyncDispatcher keeps on waiting
incessantly for dispatcher event queue to drain till JVM exits.
*Initial exception while posting RM State store event to queue*
{noformat}
2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService
(AbstractService.java:enterState(452)) - Service: Dispatcher entered state
STOPPED
2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler]
event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher
thread interrupted
java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
at
org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at
org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at
org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at
org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
{noformat}
* JStack of AsyncDispatcher hanging on stop *
{noformat}
"AsyncDispatcher event handler" prio=10 tid=0x00007fb980222800 nid=0x4b1e
waiting on condition [0x00007fb9654e9000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000700b79250> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
at java.lang.Thread.run(Thread.java:744)
"main" prio=10 tid=0x00007fb98000a800 nid=0x49c3 in Object.wait()
[0x00007fb989851000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000000700b79430> (a java.lang.Object)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:156)
- locked <0x0000000700b79430> (a java.lang.Object)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0000000700b79420> (a java.lang.Object)
at
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:515)
at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0x0000000700b79630> (a java.lang.Object)
at
org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:599)
{noformat}
> AsyncDispatcher can hang while stopping if it is configured for draining
> events on stop
> ---------------------------------------------------------------------------------------
>
> Key: YARN-3878
> URL: https://issues.apache.org/jira/browse/YARN-3878
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Varun Saxena
> Assignee: Varun Saxena
> Priority: Critical
>
> The sequence of events is as under :
> # RM is stopped while putting a RMStateStore Event to RMStateStore's
> AsyncDispatcher. This leads to an Interrupted Exception being thrown.
> # As RM is being stopped, RMStateStore's AsyncDispatcher is also stopped. On
> {{serviceStop}}, we will check if all events have been drained and wait for
> event queue to drain(as RM State Store dispatcher is configured for queue to
> drain on stop).
> # This condition never becomes true and AsyncDispatcher keeps on waiting
> incessantly for dispatcher event queue to drain till JVM exits.
> *Initial exception while posting RM State store event to queue*
> {noformat}
> 2015-06-27 20:08:35,922 DEBUG [main] service.AbstractService
> (AbstractService.java:enterState(452)) - Service: Dispatcher entered state
> STOPPED
> 2015-06-27 20:08:35,923 WARN [AsyncDispatcher event handler]
> event.AsyncDispatcher (AsyncDispatcher.java:handle(247)) - AsyncDispatcher
> thread interrupted
> java.lang.InterruptedException
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
> at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
> at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:244)
> at
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.updateApplicationAttemptState(RMStateStore.java:652)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.rememberTargetTransitionsAndStoreState(RMAppAttemptImpl.java:1173)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.access$3300(RMAppAttemptImpl.java:109)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1650)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$ContainerFinishedTransition.transition(RMAppAttemptImpl.java:1619)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:786)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:838)
> {noformat}
> *JStack of AsyncDispatcher hanging on stop*
> {noformat}
> "AsyncDispatcher event handler" prio=10 tid=0x00007fb980222800 nid=0x4b1e
> waiting on condition [0x00007fb9654e9000]
> java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for <0x0000000700b79250> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:113)
> at java.lang.Thread.run(Thread.java:744)
> "main" prio=10 tid=0x00007fb98000a800 nid=0x49c3 in Object.wait()
> [0x00007fb989851000]
> java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> - waiting on <0x0000000700b79430> (a java.lang.Object)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:156)
> - locked <0x0000000700b79430> (a java.lang.Object)
> at
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x0000000700b79420> (a java.lang.Object)
> at
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.serviceStop(RMStateStore.java:515)
> at
> org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x0000000700b79630> (a java.lang.Object)
> at
> org.apache.hadoop.service.AbstractService.close(AbstractService.java:250)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStop(ResourceManager.java:599)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)