[ 
https://issues.apache.org/jira/browse/YARN-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069202#comment-15069202
 ] 

Karthik Kambatla commented on YARN-4438:
----------------------------------------

bq. And because ZKRMStateStore is currently in active service, it cannot be 
simply moved to AlwaysOn service. So, I'd like to do it separately to minimize 
the core change in this jira.
Fine with separate JIRA. Not sure I understand why ZKRMStateStore needs to be 
an AlwaysOn service. 

bq. I'd like to change this part for RM to not refresh the configs if shared 
storage based config provider is not enabled.
I was never a fan of the shared-storage-configuration stuff. Now that we have 
it, don't think we can get rid of it until Hadoop 4. How would this change 
look? The RM has an instance of the elector; every time we transition to 
active, will either the RM or the elector check if 
shared-storage-config-provider is enabled and call refresh? 

But yeah, I do see the point of calling these methods directly from RM. 

bq. To avoid a busy loop and rejoining immediately. 
If we rejoin immediately, one of the RMs would become Active. It is not like 
the RM is going to use the cycles for anything else if we sleep. Is the concern 
that Curator may be biased in picking an RM in certain conditions? 

bq. What do you mean by force give-up ? exit RM ?
If leaderLatch.close() throws an exception, when does Curator realize the RM is 
not participating in the election anymore? If not, it might keep electing the 
same RM active? How do we handle this, and how long of a wait is okay? 

bq. Even though RM remains at standby, all services should be already shutdown, 
so there's no harm to the end users ?
Agree, there is no harm. My concern is about availability - having one of the 
RMs active "most" of the time. 

bq. I have one question about ActiveStandbyCheckThread. if we make zkStateStore 
and elector to share the same zkClient, do we still need the 
ActiveStandbyCheckThread ? the elector itself should get notification when the 
connection is lost.

Are you referring to the VerifyActiveStatusThread? The RM loses leadership; the 
connection can be restored even if it loses. We could actively go stop the 
store if it hasn't already stopped. The store would have already gotten fenced, 
so we don't run the risk of corrupting the store. So, you are right, we might 
not need that thread.

bq. This is currently what EmbeddedElectorService is doing. If the leadership 
is already lost from zk's perspective, the other RM should take up the 
leadership
You are right, it isn't a big deal. Just realized EmbeddedElectorService does 
the same today. Haven't seen Curator's LeaderLatch code. What happens if this 
RM is subsequently elected leader? Does the transition to Active succeed just 
fine? Or, is it possible it gets stuck in a way it can't transition to active? 
If it gets into such a situation, we should consider crashing it altogether. 

bq. I think leaderLatch could never be null ?
Seeing all the NPEs we have in RM/Scheduler, I would like for us to err on the 
side of caution and do null-checks. If not, we at least need to make it 
consistent everywhere. 

bq. Why does it need to be called outside of if (state == 
HAServiceProtocol.HAServiceState.ACTIVE) ? This is a fresh start, it does not 
need to call reinitiialize.
You are right. Sorry for the noise, clearly it has been a while since I looked 
at this code. 

> Implement RM leader election with curator
> -----------------------------------------
>
>                 Key: YARN-4438
>                 URL: https://issues.apache.org/jira/browse/YARN-4438
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Jian He
>            Assignee: Jian He
>         Attachments: YARN-4438.1.patch, YARN-4438.2.patch, YARN-4438.3.patch
>
>
> This is to implement the leader election with curator instead of the 
> ActiveStandbyElector from common package,  this also avoids adding more 
> configs in common to suit RM's own needs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to