[
https://issues.apache.org/jira/browse/YARN-4438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15069202#comment-15069202
]
Karthik Kambatla commented on YARN-4438:
----------------------------------------
bq. And because ZKRMStateStore is currently in active service, it cannot be
simply moved to AlwaysOn service. So, I'd like to do it separately to minimize
the core change in this jira.
Fine with separate JIRA. Not sure I understand why ZKRMStateStore needs to be
an AlwaysOn service.
bq. I'd like to change this part for RM to not refresh the configs if shared
storage based config provider is not enabled.
I was never a fan of the shared-storage-configuration stuff. Now that we have
it, don't think we can get rid of it until Hadoop 4. How would this change
look? The RM has an instance of the elector; every time we transition to
active, will either the RM or the elector check if
shared-storage-config-provider is enabled and call refresh?
But yeah, I do see the point of calling these methods directly from RM.
bq. To avoid a busy loop and rejoining immediately.
If we rejoin immediately, one of the RMs would become Active. It is not like
the RM is going to use the cycles for anything else if we sleep. Is the concern
that Curator may be biased in picking an RM in certain conditions?
bq. What do you mean by force give-up ? exit RM ?
If leaderLatch.close() throws an exception, when does Curator realize the RM is
not participating in the election anymore? If not, it might keep electing the
same RM active? How do we handle this, and how long of a wait is okay?
bq. Even though RM remains at standby, all services should be already shutdown,
so there's no harm to the end users ?
Agree, there is no harm. My concern is about availability - having one of the
RMs active "most" of the time.
bq. I have one question about ActiveStandbyCheckThread. if we make zkStateStore
and elector to share the same zkClient, do we still need the
ActiveStandbyCheckThread ? the elector itself should get notification when the
connection is lost.
Are you referring to the VerifyActiveStatusThread? The RM loses leadership; the
connection can be restored even if it loses. We could actively go stop the
store if it hasn't already stopped. The store would have already gotten fenced,
so we don't run the risk of corrupting the store. So, you are right, we might
not need that thread.
bq. This is currently what EmbeddedElectorService is doing. If the leadership
is already lost from zk's perspective, the other RM should take up the
leadership
You are right, it isn't a big deal. Just realized EmbeddedElectorService does
the same today. Haven't seen Curator's LeaderLatch code. What happens if this
RM is subsequently elected leader? Does the transition to Active succeed just
fine? Or, is it possible it gets stuck in a way it can't transition to active?
If it gets into such a situation, we should consider crashing it altogether.
bq. I think leaderLatch could never be null ?
Seeing all the NPEs we have in RM/Scheduler, I would like for us to err on the
side of caution and do null-checks. If not, we at least need to make it
consistent everywhere.
bq. Why does it need to be called outside of if (state ==
HAServiceProtocol.HAServiceState.ACTIVE) ? This is a fresh start, it does not
need to call reinitiialize.
You are right. Sorry for the noise, clearly it has been a while since I looked
at this code.
> Implement RM leader election with curator
> -----------------------------------------
>
> Key: YARN-4438
> URL: https://issues.apache.org/jira/browse/YARN-4438
> Project: Hadoop YARN
> Issue Type: Improvement
> Reporter: Jian He
> Assignee: Jian He
> Attachments: YARN-4438.1.patch, YARN-4438.2.patch, YARN-4438.3.patch
>
>
> This is to implement the leader election with curator instead of the
> ActiveStandbyElector from common package, this also avoids adding more
> configs in common to suit RM's own needs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)