[ 
https://issues.apache.org/jira/browse/YARN-2019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734100#comment-14734100
 ] 

Bikas Saha commented on YARN-2019:
----------------------------------

Sorry for coming in late on this. There would be 2 kinds of state store 
operations - reads and writes. Writes may be of 2 kinds - critical and 
non-critical. E.g. saving an application submission is critical. Saving a node 
information is perhaps not critical. It would affect system correctness is 
critical write operation errors are allowed to be ignored. We end up with 
YARN-4118 and other such potential issues. Essentially we are turning a 
write-ahead log into something that optional. I dont see how the system can 
make stable reliability guarantees by making the write-ahead log non-fatal.
On the other hand read errors or non-critical write errors should not block RM 
progress but do need to be potentially retried. That also does not seem to be 
addressed in the patch.

> Retrospect on decision of making RM crashed if any exception throw in 
> ZKRMStateStore
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2019
>                 URL: https://issues.apache.org/jira/browse/YARN-2019
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Junping Du
>            Assignee: Jian He
>            Priority: Critical
>              Labels: ha
>             Fix For: 2.8.0, 2.7.2, 2.6.2
>
>         Attachments: YARN-2019.1-wip.patch, YARN-2019.patch, YARN-2019.patch
>
>
> Currently, if any abnormal happens in ZKRMStateStore, it will throw a fetal 
> exception to crash RM down. As shown in YARN-1924, it could due to RM HA 
> internal bug itself, but not fatal exception. We should retrospect some 
> decision here as HA feature is designed to protect key component but not 
> disturb it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to