[ 
https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873670#comment-13873670
 ] 

Bikas Saha commented on YARN-1602:
----------------------------------

not all events are app related. some store secret key stores which cannot be 
ignored. 
what errors are we seeing in the store. if these are non-transient errors then 
the RM should probably stop. if these are transient errors then I remember 
discussing with [~vinodkv] and [~jianhe] about this offline. The summary is 
that the state store client (eg HDFS client) should retry enough times to cover 
cases of transient errors in the store.
With HA states now, we should ideally not kill the RM but just 
transitionToStandby().

> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>
>                 Key: YARN-1602
>                 URL: https://issues.apache.org/jira/browse/YARN-1602
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Critical
>
> Currently, if a state store operation fails, depending on the exception, 
> either a RMFatalEvent.STATE_STORE_FENCED or 
> RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in 
> the RM failing. Instead, we should probably kill the application 
> corresponding to the store operation. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to