[ https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13990050#comment-13990050 ]
Tsuyoshi OZAWA commented on YARN-2019: -------------------------------------- This means that all RM can terminates when ZK cannot be accessed from RMs. If we should retry until ZK come up, one solution is handling STATE_STORE_OP_FAILED in RMFatalEventDispatcher and going into standby state. Please see an attached patch . > Retrospect on decision of making RM crashed if any exception throw in > ZKRMStateStore > ------------------------------------------------------------------------------------ > > Key: YARN-2019 > URL: https://issues.apache.org/jira/browse/YARN-2019 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Junping Du > Priority: Critical > Labels: ha > Attachments: YARN-2019.1-wip.patch > > > Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal > exception to crash RM down. As shown in YARN-1924, it could due to RM HA > internal bug itself, but not fatal exception. We should retrospect some > decision here as HA feature is designed to protect key component but not > disturb it. -- This message was sent by Atlassian JIRA (v6.2#6252)