[
https://issues.apache.org/jira/browse/YARN-6647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259398#comment-16259398
]
Jason Lowe commented on YARN-6647:
----------------------------------
bq. IIUC its not the interrupted exception bubbling cased by Zk operation
interrupt which is causing the issue.
Indirectly it is, otherwise this should not be related to a curator change.
The reason the fatal event is trying to be sent is because the ZKRMStateStore
reported the shutdown-related interrupt exception as a store failure. However
looking at this again, I'm not sure the state store can correctly distinguish a
spurious interrupted exception from a shutdown-related exception since I
believe the state store itself isn't shut down yet.
bq. We should skip notifyStoreOperationFailedInternal if the current thread is
interrupted which should avoid this case thoughts??
I'm not a fan of this since the state store would have to make the assumption
that any interrupted exception is caused by a shutdown, but that is not
guaranteed to be the case. Seems like this could be handled by the delegation
token secret manager which _does_ know it is shutting down at the time and
ultimately is the one responsible for calling ExitUtil. Specifically I'm
thinking of this code and others like it in RMDelegationTokenSecretManager:
{code}
protected void storeNewMasterKey(DelegationKey newKey) {
try {
LOG.info("storing master key with keyID " + newKey.getKeyId());
rm.getRMContext().getStateStore().storeRMDTMasterKey(newKey);
} catch (Exception e) {
LOG.error("Error in storing master key with KeyID: " + newKey.getKeyId());
ExitUtil.terminate(1, e);
}
}
{code}
could change to look something like this:
{code}
protected void storeNewMasterKey(DelegationKey newKey) {
try {
LOG.info("storing master key with keyID " + newKey.getKeyId());
rm.getRMContext().getStateStore().storeRMDTMasterKey(newKey);
} catch (Exception e) {
if (!shouldIgnoreException(e)) {
LOG.error("Error in storing master key with KeyID: " +
newKey.getKeyId());
ExitUtil.terminate(1, e);
}
}
}
private boolean shouldIgnoreException(Exception e) {
return !running && e.getCause() instanceof InterruptedException;
}
{code}
> RM can crash during transitionToStandby due to InterruptedException
> -------------------------------------------------------------------
>
> Key: YARN-6647
> URL: https://issues.apache.org/jira/browse/YARN-6647
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.0.0-alpha4
> Reporter: Jason Lowe
> Priority: Critical
>
> Noticed some tests were failing due to the JVM shutting down early. I was
> able to reproduce this occasionally with TestKillApplicationWithRMHA.
> Stacktrace to follow.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]