[ 
https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13876826#comment-13876826
 ] 

Karthik Kambatla commented on YARN-1602:
----------------------------------------

The above error happens only after running a number of Oozie jobs on the RM for 
a while - so, I don't think it is due to bad configuration. So, transitioning 
both RMs to Standby, would only result in alternating between the two RMs 
becoming the Active until the application gets killed because of exceeding the 
max-attempts. The only downside I see is the other applications might also be 
killed in the process.

bq. The RMs will stop touching the store and the admin can fix it.
The admin might be able to fix it by explicitly deleting some znodes from the 
store, but that would require understanding the store layout. 

Let me investigate more and see what the underlying cause for this issue is. 
May be, that would simplify what we should do in such cases.


> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>
>                 Key: YARN-1602
>                 URL: https://issues.apache.org/jira/browse/YARN-1602
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> Currently, if a state store operation fails, depending on the exception, 
> either a RMFatalEvent.STATE_STORE_FENCED or 
> RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in 
> the RM failing. Instead, we should probably kill the application 
> corresponding to the store operation. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to