Junping Du commented on YARN-2019:

+1 on general idea of YARN-3607. However, here users may have three options 
actually when facing error of ZKRMStateStore:
1. aggressive to fail RM daemon;
2. conservative to only log these errors without failed RM daemon and any 
3. relative conservative - not failed RM but failed application in some cases 
(like RM get restarted).
These choices may hint we may not want to force the policy of handling on all 
failures into a single configuration, although I agree we should 
combine/consolidate them as many as possible like what proposed by YARN-3607. 
Particularly in this case, I may prefer to add a separated configuration (may 
be something like: a boolean value for 
"yarn.resourcemanager.state-store.exit-on-error" or an enum value for 
"yarn.resourcemanager.state-store.policy-on-error"?) to allow user to choose 
when facing RM state store failures. So user got other options for other 
failure cases.

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> ------------------------------------------------------------------------------------
>                 Key: YARN-2019
>                 URL: https://issues.apache.org/jira/browse/YARN-2019
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Jian He
>            Priority: Critical
>              Labels: ha
>         Attachments: YARN-2019.1-wip.patch
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.

This message was sent by Atlassian JIRA

Reply via email to