[
https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876808#comment-13876808
]
Bikas Saha commented on YARN-1602:
----------------------------------
If its a non-transient error then the RMs should go into Standby. It might be
good to the second RM to try the operation. In the worst case it will also go
to standby. But if the misconfiguration is only on 1 RM locally then the
cluster will continue running. If its a global issue then all RMs will be in
standby and hopefully that will alert the admin. The RMs will stop touching the
store and the admin can fix it. Then we can ask all RMs to transitionToActive
and participate in leader election. Sounds good?
> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>
> Key: YARN-1602
> URL: https://issues.apache.org/jira/browse/YARN-1602
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.4.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Critical
>
> Currently, if a state store operation fails, depending on the exception,
> either a RMFatalEvent.STATE_STORE_FENCED or
> RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in
> the RM failing. Instead, we should probably kill the application
> corresponding to the store operation.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)