[
https://issues.apache.org/jira/browse/YARN-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873670#comment-13873670
]
Bikas Saha commented on YARN-1602:
----------------------------------
not all events are app related. some store secret key stores which cannot be
ignored.
what errors are we seeing in the store. if these are non-transient errors then
the RM should probably stop. if these are transient errors then I remember
discussing with [~vinodkv] and [~jianhe] about this offline. The summary is
that the state store client (eg HDFS client) should retry enough times to cover
cases of transient errors in the store.
With HA states now, we should ideally not kill the RM but just
transitionToStandby().
> All failed RMStateStore operations should not be RMFatalEvents
> --------------------------------------------------------------
>
> Key: YARN-1602
> URL: https://issues.apache.org/jira/browse/YARN-1602
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.4.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Critical
>
> Currently, if a state store operation fails, depending on the exception,
> either a RMFatalEvent.STATE_STORE_FENCED or
> RMFatalEvent.STATE_STORE_OP_FAILED events are created. The latter results in
> the RM failing. Instead, we should probably kill the application
> corresponding to the store operation.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)