[jira] [Commented] (YARN-6647) RM can crash during shutdown due to InterruptedException

Bibin A Chundatt (JIRA) Mon, 20 Nov 2017 01:32:17 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259007#comment-16259007
 ]


Bibin A Chundatt commented on YARN-6647:
----------------------------------------

[~jlowe]
Adding analysis done as part of YARN-7515 in this jira
{quote}
 and the interrupt exception ended up bubbling all the way up to the dispatcher 
which caused the JVM exit
{quote}
IIUC its not the interrupted exception bubbling cased by Zk operation interrupt 
which is causing the issue. *RMFatalEvent* to {{AsyncDispatcher#EventHandler}} 
from *Interrupted thread* ie 
{{AbstractDelegationTokenSecretManager#ExpiredTokenRemover}} is caused by  {{Zk 
operation interrupt}} .  please do correct me if i am wrong. 

*Analysis*

{code}
   try {
          eventQueue.put(event);
      } catch (InterruptedException e) {
        if (!stopped) {
          LOG.warn(
              "AsyncDispatcher thread interrupted " + Thread.currentThread()
                  .getName(), e);
        }
        // Need to reset drained flag to true if event queue is empty,
        // otherwise dispatcher will hang on stop.
        drained = eventQueue.isEmpty();
        throw new YarnRuntimeException(e);
      }
{code}
put operation to {{LinkedBlockingQueue}} from an interrupted thread.
{code}
public void put(E e) throws InterruptedException {
..
     putLock.lockInterruptibly();
}
{code}
{code}
     public final void acquireInterruptibly(int arg)
            throws InterruptedException {
        if (Thread.interrupted())
            throw new InterruptedException();
        }
{code}

*RM switch over flow  which could shutdown RM*

Resource manager {{transitionToStandby()}}--> {{RMActiveService.stop()}} --> 
{{RMSecretManagerService#serviceStop()}}
->{{rmDTSecretManager.stopThreads()}}
{code}
      synchronized (noInterruptsLock) {
        tokenRemoverThread.interrupt();
      }
{code}
{{ExpiredTokenRemover}} interrupted during  {{rollMasterKey()}}  throws 
{{InterruptedException}} which causes {{notifyStoreOperationFailedInternal}}   
in
{{RMStateStore#StoreRMDTMasterKeyTransition}}
{code}
      try {
        LOG.info("Storing RMDTMasterKey.");
        store.storeRMDTMasterKeyState(dtEvent.getDelegationKey());
      } catch (Exception e) {
        LOG.error("Error While Storing RMDTMasterKey.", e);
        isFenced = store.notifyStoreOperationFailedInternal(e);
      }
{code}
{{store.notifyStoreOperationFailedInternal}} eventually fires {{RMFatalEvent}} 
from {{ExpiredTokenRemover}} thread which is *interrupted* 
{code}
    rmDispatcher.getEventHandler().handle(
          new RMFatalEvent(RMFatalEventType.STATE_STORE_FENCED,
              failureCause));
{code}
eventually causing {{LinkedBlockingQueue#put}} to fail and *RM Exit*

*Solution:* We should skip {{notifyStoreOperationFailedInternal}} if the 
current thread is interrupted which should avoid this case thoughts??

*Issue exist only in 3.0.o alpha+* since curator version was changed to 
{{2.12.0}} 

{code}
 public static<T> T      callWithRetry(CuratorZookeeperClient client, 
Callable<T> proc) throws Exception
    {
        T               result = null;
        RetryLoop       retryLoop = client.newRetryLoop();
        while ( retryLoop.shouldContinue() )
        {
            try
            {
      ..      }
            catch ( Exception e )
            {
                *ThreadUtils.checkInterrupted(e);*
                retryLoop.takeException(e);
            }
        }
        return result;
    }
{code}

related jira HADOOP-14187 

> RM can crash during shutdown due to InterruptedException
> --------------------------------------------------------
>
>                 Key: YARN-6647
>                 URL: https://issues.apache.org/jira/browse/YARN-6647
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Jason Lowe
>
> Noticed some tests were failing due to the JVM shutting down early.  I was 
> able to reproduce this occasionally with TestKillApplicationWithRMHA.  
> Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-6647) RM can crash during shutdown due to InterruptedException

Reply via email to