[
https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259007#comment-16259007
]
Bibin A Chundatt commented on YARN-6647:
----------------------------------------
[~jlowe]
Adding analysis done as part of YARN-7515 in this jira
{quote}
and the interrupt exception ended up bubbling all the way up to the dispatcher
which caused the JVM exit
{quote}
IIUC its not the interrupted exception bubbling cased by Zk operation interrupt
which is causing the issue. *RMFatalEvent* to {{AsyncDispatcher#EventHandler}}
from *Interrupted thread* ie
{{AbstractDelegationTokenSecretManager#ExpiredTokenRemover}} is caused by {{Zk
operation interrupt}} . please do correct me if i am wrong.
*Analysis*
{code}
try {
eventQueue.put(event);
} catch (InterruptedException e) {
if (!stopped) {
LOG.warn(
"AsyncDispatcher thread interrupted " + Thread.currentThread()
.getName(), e);
}
// Need to reset drained flag to true if event queue is empty,
// otherwise dispatcher will hang on stop.
drained = eventQueue.isEmpty();
throw new YarnRuntimeException(e);
}
{code}
put operation to {{LinkedBlockingQueue}} from an interrupted thread.
{code}
public void put(E e) throws InterruptedException {
..
putLock.lockInterruptibly();
}
{code}
{code}
public final void acquireInterruptibly(int arg)
throws InterruptedException {
if (Thread.interrupted())
throw new InterruptedException();
}
{code}
*RM switch over flow which could shutdown RM*
Resource manager {{transitionToStandby()}}--> {{RMActiveService.stop()}} -->
{{RMSecretManagerService#serviceStop()}}
->{{rmDTSecretManager.stopThreads()}}
{code}
synchronized (noInterruptsLock) {
tokenRemoverThread.interrupt();
}
{code}
{{ExpiredTokenRemover}} interrupted during {{rollMasterKey()}} throws
{{InterruptedException}} which causes {{notifyStoreOperationFailedInternal}}
in
{{RMStateStore#StoreRMDTMasterKeyTransition}}
{code}
try {
LOG.info("Storing RMDTMasterKey.");
store.storeRMDTMasterKeyState(dtEvent.getDelegationKey());
} catch (Exception e) {
LOG.error("Error While Storing RMDTMasterKey.", e);
isFenced = store.notifyStoreOperationFailedInternal(e);
}
{code}
{{store.notifyStoreOperationFailedInternal}} eventually fires {{RMFatalEvent}}
from {{ExpiredTokenRemover}} thread which is *interrupted*
{code}
rmDispatcher.getEventHandler().handle(
new RMFatalEvent(RMFatalEventType.STATE_STORE_FENCED,
failureCause));
{code}
eventually causing {{LinkedBlockingQueue#put}} to fail and *RM Exit*
*Solution:* We should skip {{notifyStoreOperationFailedInternal}} if the
current thread is interrupted which should avoid this case thoughts??
*Issue exist only in 3.0.o alpha+* since curator version was changed to
{{2.12.0}}
{code}
public static<T> T callWithRetry(CuratorZookeeperClient client,
Callable<T> proc) throws Exception
{
T result = null;
RetryLoop retryLoop = client.newRetryLoop();
while ( retryLoop.shouldContinue() )
{
try
{
.. }
catch ( Exception e )
{
*ThreadUtils.checkInterrupted(e);*
retryLoop.takeException(e);
}
}
return result;
}
{code}
related jira HADOOP-14187
> RM can crash during shutdown due to InterruptedException
> --------------------------------------------------------
>
> Key: YARN-6647
> URL: https://issues.apache.org/jira/browse/YARN-6647
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.0.0-alpha4
> Reporter: Jason Lowe
>
> Noticed some tests were failing due to the JVM shutting down early. I was
> able to reproduce this occasionally with TestKillApplicationWithRMHA.
> Stacktrace to follow.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]