[
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853970#comment-13853970
]
Bikas Saha commented on YARN-1029:
----------------------------------
Why fencing configurable when ZK store is self-fenced? I dont think we need to
add an fencing related code for embedded FC except for a dummy fencer to pass
into the elector code.
{code}+ public static final String RM_HA_FENCER = RM_HA_PREFIX +
"fencer";{code}
Can we please consolidate all zk configs in one place in the file
Isnt rmId enough because the rest of its is available from config. The port is
anyways one of many rm ports.
{code}+ required int32 port = 1;
+ required string hostname = 2;
+ required string clusterid = 3;
+ required string rmId = 4;{code}
There is a separate jira open to add a cluster-id
dropped the synchronized?
{code}- private synchronized boolean isRMActive() {{code}
there is no fencer in embedded election, right?
{code}+ @Override
+ public void becomeStandby() {
+ try {
+ rm.transitionToStandby(true);
+ } catch (Exception e) {
+ // Log the exception. The fencer should be able to fence this node
+ LOG.error("RM could not transition to Standby mode", e);
+ }
+ }{code}
this is probably not enough. we need to notify the rm.
{code}@Override
+ public void notifyFatalError(String errorMessage) {
+ LOG.fatal("Received " + errorMessage);
+ throw new YarnRuntimeException(errorMessage);
+ }{code}
this should be empty. there is no fencing in embedded election because zk store
is self-fenced.
{code}@Override
+ public void fenceOldActive(byte[] oldActiveData) {
+ RMHAServiceTarget target = dataToTarget(oldActiveData);
+
+ try {
+ target.checkFencingConfigured();
+ } catch (BadFencingConfigurationException e) {
+ throw new YarnBadConfigurationException(e.getMessage());
+ }
+
+ if (!target.getFencer().fence(target)) {
+ throw new YarnRuntimeException("Could not fence old active");
+ }
+ }{code}
Didnt quite get the purpose of the new thread. Why can we not call
elector.joinElection() in serviceStart(). There is no need for us to loop and
keep calling joinElection() in a thread.
Use newly created HAUtil helper methods?
{code}
+ if (conf.getBoolean(YarnConfiguration.AUTO_FAILOVER_ENABLED,
+ YarnConfiguration.DEFAULT_AUTO_FAILOVER_ENABLED)) {
+ // Automatic failover enabled
+ if (conf.getBoolean(YarnConfiguration.AUTO_FAILOVER_EMBEDDED,
+ YarnConfiguration.DEFAULT_AUTO_FAILOVER_EMBEDDED)) {
+ // Embedded automatic failover enabled
+ electorService = createRMZKActiveStandbyElectorService();
+ addIfService(electorService);
{code}
In the embedded failover test how do we know that the ZK based failover is
being triggered? I did not understand how failover can happen so quickly when
the zk session timeout is 10s.
IMO the ElectorService should not be calling RM.transitionToActive/Standby. It
should be calling AdminService.transitionToActive/Standby. The AdminService is
the only HA entry point into the system. By calling directly into the RM we are
breaking the abstractions that everything else is going to follow.
Also, an alternative layering would be if the ElectorService would be made a
member of the AdminService. There is no need for the main body of the RM to
know about failover or failover controllers (FC) etc. Interaction with any FC
for failover is abstracted in the AdminService. So IMO if FC is configured to
be embedded then we can maintain the abstraction and embed it into the
AdminService.
> Allow embedding leader election into the RM
> -------------------------------------------
>
> Key: YARN-1029
> URL: https://issues.apache.org/jira/browse/YARN-1029
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Karthik Kambatla
> Attachments: embedded-zkfc-approach.patch, yarn-1029-0.patch,
> yarn-1029-0.patch, yarn-1029-1.patch, yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such
> that ZooKeeper based leader election and notification is in-built. In
> conjunction with a ZK state store, this configuration will be a simple
> deployment option.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)