[jira] [Comment Edited] (YARN-5694) ZKRMStateStore should always start its verification thread to prevent accidental state store corruption

Jian He (JIRA) Thu, 10 Nov 2016 10:10:21 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-5694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654697#comment-15654697
 ]


Jian He edited comment on YARN-5694 at 11/10/16 6:09 PM:
---------------------------------------------------------

bq. If we agree that it's bad to have two RMs accidentally sharing the same 
state store, 
If it's in non-HA mode, currently there's no protection in the ZKStore 
preventing two RMs from sharing the same store. All the ACLs setting related 
code is only used in HA mode. Essentially, with current patch, I doubt it will 
get NoAuthException in the verifyThread, without making user change the ACLs 
manually. So the handling code in this patch will not be triggered with default 
setting. Maybe I'm wrong, you may try on a real cluster.. also, I thinking 
setting ACLs for RM is not a required step for deploying non-HA cluster, 
forcing this to be set is behavior change..

bq. why would you not want to catch the issue as early as possible?
My point is that first,will this code work as mentioned above. second, if 
there's no difference in terms of functionality, why do I need to start a 
thread pinging the zk continuously every few seconds.  Of course, I might miss 
something, you may clarify more...

Also, is the use-case mainly about two clusters sharing the same zk-store with 
the same path ?  IMHO, this is not a primary use-case to solve, if user 
mis-configured, it's user's fault. There are many other places that can go 
wrong.  e.g. if two clusters configure the same path for anything on HDFS.

If the use-case is about two RMs sharing the same zk-path in the same cluster 
with non-HA mode. I think in non-HA mode, the invalid RM will not take workload 
in the first place, clients, NMs will not switch to that RM if HA is not 
configured properly. 


was (Author: jianhe):
bq. If we agree that it's bad to have two RMs accidentally sharing the same 
state store, 
If it's in non-HA mode, currently there's no protection in the ZKStore 
preventing two RMs from sharing the same store. All the ACLs setting related 
code is only used in HA mode. Essentially, with current patch, I doubt it will 
get NoAuthException in the verifyThread, without making user change the ACLs 
manually. So the handling code in this patch will not be triggered with default 
setting. Maybe I'm wrong, you may try on a real cluster..

bq. why would you not want to catch the issue as early as possible?
My point is that first,will this code work as mentioned above. second, if 
there's no difference in terms of functionality, why do I need to start a 
thread pinging the zk continuously every few seconds.  Of course, I might miss 
something, you may clarify more...

Also, is the use-case mainly about two clusters sharing the same zk-store with 
the same path ?  IMHO, this is not a primary use-case to solve, if user 
mis-configured, it's user's fault. There are many other places that can go 
wrong.  e.g. if two clusters configure the same path for anything on HDFS.

If the use-case is about two RMs sharing the same zk-path in the same cluster 
with non-HA mode. I think in non-HA mode, the invalid RM will not take workload 
in the first place, clients, NMs will not switch to that RM if HA is not 
configured properly. 

> ZKRMStateStore should always start its verification thread to prevent 
> accidental state store corruption
> -------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-5694
>                 URL: https://issues.apache.org/jira/browse/YARN-5694
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.0.0-alpha1
>            Reporter: Daniel Templeton
>            Assignee: Daniel Templeton
>            Priority: Critical
>              Labels: oct16-medium
>         Attachments: YARN-5694.001.patch, YARN-5694.002.patch, 
> YARN-5694.003.patch, YARN-5694.004.patch, YARN-5694.004.patch, 
> YARN-5694.005.patch, YARN-5694.006.patch, YARN-5694.007.patch, 
> YARN-5694.branch-2.7.001.patch, YARN-5694.branch-2.7.002.patch
>
>
> There are two cases.  In branch-2.7, the 
> {{ZKRMStateStore.VerifyActiveStatusThread}} is always started, even when 
> using embedded or Curator failover.  In branch-2.8, the 
> {{ZKRMStateStore.VerifyActiveStatusThread}} is only started when HA is 
> disabled, which makes no sense.  Based on the JIRA that introduced that 
> change (YARN-4559), I believe the intent was to start it only when embedded 
> failover is disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-5694) ZKRMStateStore should always start its verification thread to prevent accidental state store corruption

Reply via email to