[jira] [Commented] (YARN-1602) All failed RMStateStore operations should not be RMFatalEvents

Bikas Saha (JIRA) Mon, 20 Jan 2014 12:05:45 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876808#comment-13876808
 ]


Bikas Saha commented on YARN-1602:
----------------------------------

If its a non-transient error then the RMs should go into Standby. It might be 
good to the second RM to try the operation. In the worst case it will also go 
to standby. But if the misconfiguration is only on 1 RM locally then the 
cluster will continue running. If its a global issue then all RMs will be in 
standby and hopefully that will alert the admin. The RMs will stop touching the 
store and the admin can fix it. Then we can ask all RMs to transitionToActive 
and participate in leader election. Sounds good?

> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>
>                 Key: YARN-1602
>                 URL: https://issues.apache.org/jira/browse/YARN-1602
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> Currently, if a state store operation fails, depending on the exception, 
> either a RMFatalEvent.STATE_STORE_FENCED or 
> RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in 
> the RM failing. Instead, we should probably kill the application 
> corresponding to the store operation. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1602) All failed RMStateStore operations should not be RMFatalEvents

Reply via email to