[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

Karthik Kambatla (JIRA) Wed, 25 Dec 2013 09:57:51 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856625#comment-13856625
 ]


Karthik Kambatla commented on YARN-1029:
----------------------------------------

Thanks Bikas.

bq. I see that the patch has increased the node manager connect time in the 
test from 5s to 11s. Its not clear to me how the test earlier worked or works 
now.
Good catch! Forgot to reset the node manager connect time to 5s, even that is a 
guess. Based on your comment earlier about the ZK-interval being higher than 
the time we are waiting for failover, I ran the test several times and figured 
it fails sometimes, which makes sense. So, now, 
{{MiniYARNCluster#getActiveRMIndex()}} waits for 10 seconds (by default) for 
the RM to failover. So, the NM/client connection verification need not wait for 
the failover itself. 

bq. Should we clear or fail to start? The data seems to be in error.
Good point! When using ZKFC, the format command takes care of clearing the 
parentZNode. Ideally, we should have a similar option to be able to format the 
znode, and the elector should fail if the znode is not safe. In the next patch, 
let me fail to start and open another JIRA for adding the admin option.

bq. This method used to be synchronized
The method should have been synchronized in YARN-1481, [~vinodkv] and I thought 
we could handle it here. Could do an addendum patch there instead if that is 
preferred.

bq. Is it necessary to mention ZKFC here?
Yes. The alternative is to add another RequestSource or rename the ZKFC source 
to something else. I think it is okay to leave it as is. 

bq. Can we share the terminate functianality with 
RMStateStoreOperationFailedEventDispatcher in a common function?
Not sure how advantageous that will be - we ll end up calling the common method 
instead of ExitUtil.terminate only for the common method to call it? Also, 
getCause() doesn't exist in AbstractEvent requiring us to add a new kind of 
event (CausedEvent?) that both these events extend. Seems too complicated for 
the gain.

bq. We probably need some unified method of notifying the RM about something 
bad. One example being embedded leader election reporting an error. Else we may 
end up with a proliferation of event handlers.
Agree - 100%, but would like to do it lazily when another such case pops up. 

[~vinodkv], [~sandyr] - can one of you take a look at least at the 
MiniYARNCluster changes?

> Allow embedding leader election into the RM
> -------------------------------------------
>
>                 Key: YARN-1029
>                 URL: https://issues.apache.org/jira/browse/YARN-1029
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Karthik Kambatla
>         Attachments: embedded-zkfc-approach.patch, yarn-1029-0.patch, 
> yarn-1029-0.patch, yarn-1029-1.patch, yarn-1029-2.patch, yarn-1029-3.patch, 
> yarn-1029-approach.patch
>
>
> It should be possible to embed common ActiveStandyElector into the RM such 
> that ZooKeeper based leader election and notification is in-built. In 
> conjunction with a ZK state store, this configuration will be a simple 
> deployment option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (YARN-1029) Allow embedding leader election into the RM

Reply via email to