[ 
https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259398#comment-16259398
 ] 

Jason Lowe commented on YARN-6647:
----------------------------------

bq. IIUC its not the interrupted exception bubbling cased by Zk operation 
interrupt which is causing the issue.

Indirectly it is, otherwise this should not be related to a curator change.  
The reason the fatal event is trying to be sent is because the ZKRMStateStore 
reported the shutdown-related interrupt exception as a store failure.  However 
looking at this again, I'm not sure the state store can correctly distinguish a 
spurious interrupted exception from a shutdown-related exception since I 
believe the state store itself isn't shut down yet.

bq. We should skip notifyStoreOperationFailedInternal if the current thread is 
interrupted which should avoid this case thoughts??

I'm not a fan of this since the state store would have to make the assumption 
that any interrupted exception is caused by a shutdown, but that is not 
guaranteed to be the case.  Seems like this could be handled by the delegation 
token secret manager which _does_ know it is shutting down at the time and 
ultimately is the one responsible for calling ExitUtil.  Specifically I'm 
thinking of this code and others like it in RMDelegationTokenSecretManager:
{code}
  protected void storeNewMasterKey(DelegationKey newKey) {
    try {
      LOG.info("storing master key with keyID " + newKey.getKeyId());
      rm.getRMContext().getStateStore().storeRMDTMasterKey(newKey);
    } catch (Exception e) {
      LOG.error("Error in storing master key with KeyID: " + newKey.getKeyId());
      ExitUtil.terminate(1, e);
    }
  }
{code}
could change to look something like this:
{code}
  protected void storeNewMasterKey(DelegationKey newKey) {
    try {
      LOG.info("storing master key with keyID " + newKey.getKeyId());
      rm.getRMContext().getStateStore().storeRMDTMasterKey(newKey);
    } catch (Exception e) {
      if (!shouldIgnoreException(e)) {
        LOG.error("Error in storing master key with KeyID: " + 
newKey.getKeyId());
        ExitUtil.terminate(1, e);
      }
    }
  }

  private boolean shouldIgnoreException(Exception e) {
    return !running && e.getCause() instanceof InterruptedException;
  }
{code}


> RM can crash during transitionToStandby due to InterruptedException
> -------------------------------------------------------------------
>
>                 Key: YARN-6647
>                 URL: https://issues.apache.org/jira/browse/YARN-6647
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha4
>            Reporter: Jason Lowe
>            Priority: Critical
>
> Noticed some tests were failing due to the JVM shutting down early.  I was 
> able to reproduce this occasionally with TestKillApplicationWithRMHA.  
> Stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to