[
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856625#comment-13856625
]
Karthik Kambatla commented on YARN-1029:
----------------------------------------
Thanks Bikas.
bq. I see that the patch has increased the node manager connect time in the
test from 5s to 11s. Its not clear to me how the test earlier worked or works
now.
Good catch! Forgot to reset the node manager connect time to 5s, even that is a
guess. Based on your comment earlier about the ZK-interval being higher than
the time we are waiting for failover, I ran the test several times and figured
it fails sometimes, which makes sense. So, now,
{{MiniYARNCluster#getActiveRMIndex()}} waits for 10 seconds (by default) for
the RM to failover. So, the NM/client connection verification need not wait for
the failover itself.
bq. Should we clear or fail to start? The data seems to be in error.
Good point! When using ZKFC, the format command takes care of clearing the
parentZNode. Ideally, we should have a similar option to be able to format the
znode, and the elector should fail if the znode is not safe. In the next patch,
let me fail to start and open another JIRA for adding the admin option.
bq. This method used to be synchronized
The method should have been synchronized in YARN-1481, [~vinodkv] and I thought
we could handle it here. Could do an addendum patch there instead if that is
preferred.
bq. Is it necessary to mention ZKFC here?
Yes. The alternative is to add another RequestSource or rename the ZKFC source
to something else. I think it is okay to leave it as is.
bq. Can we share the terminate functianality with
RMStateStoreOperationFailedEventDispatcher in a common function?
Not sure how advantageous that will be - we ll end up calling the common method
instead of ExitUtil.terminate only for the common method to call it? Also,
getCause() doesn't exist in AbstractEvent requiring us to add a new kind of
event (CausedEvent?) that both these events extend. Seems too complicated for
the gain.
bq. We probably need some unified method of notifying the RM about something
bad. One example being embedded leader election reporting an error. Else we may
end up with a proliferation of event handlers.
Agree - 100%, but would like to do it lazily when another such case pops up.
[~vinodkv], [~sandyr] - can one of you take a look at least at the
MiniYARNCluster changes?
> Allow embedding leader election into the RM
> -------------------------------------------
>
> Key: YARN-1029
> URL: https://issues.apache.org/jira/browse/YARN-1029
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Karthik Kambatla
> Attachments: embedded-zkfc-approach.patch, yarn-1029-0.patch,
> yarn-1029-0.patch, yarn-1029-1.patch, yarn-1029-2.patch, yarn-1029-3.patch,
> yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such
> that ZooKeeper based leader election and notification is in-built. In
> conjunction with a ZK state store, this configuration will be a simple
> deployment option.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)