[jira] [Commented] (YARN-1584) Support explicit failover when automatic failover is enabled

2014-01-15 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13872514#comment-13872514
 ] 

Bikas Saha commented on YARN-1584:
--

And what use case does this solve other than make RM1 become standby? Downside 
is a lot of churn in client that are currently using YARN as they switch from 
RM1 to RM2? If RM1 is going to go back and join the election anyways then why 
are we doing this?
Secondly, I dont see how this works when there are 3 RMs in the leader 
election. I dont think we want to restrict our model to only support 2 RMs.
Lastly, isnt this something that can be done entirely on the client side. 
Client would first call transitionToStandby() on all RM's other than the one 
that the admin want to make active. It will determine if the desired RM has 
become active (or timeout) and then make the remaining RMs' 
transitionToActive() and rejoin the election.

 Support explicit failover when automatic failover is enabled
 

 Key: YARN-1584
 URL: https://issues.apache.org/jira/browse/YARN-1584
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 YARN-1029 adds automatic failover support. However, users can't explicitly 
 ask for a failover from one RM to the other without stopping the other RM. 
 Stopping the RM until the other RM takes over and then restarting the first 
 RM is more involving and exposes the RM-ensemble to SPOF for a longer 
 duration. 
 It would be nice to allow explicit failover through yarn rmadmin -failover 
 command.
 PS: HDFS supports -failover option. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1584) Support explicit failover when automatic failover is enabled

2014-01-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871425#comment-13871425
 ] 

Karthik Kambatla commented on YARN-1584:


bq. The duration of failover depends on how long ZK needs to figure out that 
the leader is gone. Then notifying the new leader. Then new leader reading 
state.
Right. I agree that the failover takes the same time irrespective of whether it 
is through graceful failover or shutting the RM down.

bq. Its not clear to me how any of these steps are faster with a admin failover 
option.
I was referring to the duration after the failover for which a single RM is up. 
In other words, the duration for recovering the RM that was shutdown. Firstly, 
it requires manually checking the other RM has actually taken over, which in 
itself is slower than handling it automatically. Then, the start-up time for 
the second RM; the start-up might become an issue if/when the Standby and the 
other services retain/pre-fetch state. 

IMO, the biggest gain of supporting -failover is the ease of use. What do you 
think of adding a config whether to support graceful failover and may be we can 
turn it off by default. 

bq. When the RM is asked to transition to active via the AdminService 
(FORCE_USER) flag, then the AdminService can transition to standby and then 
notify the elector to quitElection(). That API is present on the elector for 
this specific purpose. The elector gives up participation in the leader 
election process. This RM will remain in standby (because the elector is not 
going to notify it anymore) until the admin ask it to 
transitionToActive(FORCE_USER). Later, when the AdminService is asked to 
transitionToActive() it can call the joinElection API on the elector to rejoin 
the leader election and stay in the Standby state. The elector will join the 
election and notify the RM to transitionToActive if it wins the election.
The transitionToStandby() part sounds reasonable to me. 
transitionToActive(FORCE_USER) wouldn't actually transition the RM to Active, 
but instead just become ready to be Active? Users might find it confusing. 

 Support explicit failover when automatic failover is enabled
 

 Key: YARN-1584
 URL: https://issues.apache.org/jira/browse/YARN-1584
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 YARN-1029 adds automatic failover support. However, users can't explicitly 
 ask for a failover from one RM to the other without stopping the other RM. 
 Stopping the RM until the other RM takes over and then restarting the first 
 RM is more involving and exposes the RM-ensemble to SPOF for a longer 
 duration. 
 It would be nice to allow explicit failover through yarn rmadmin -failover 
 command.
 PS: HDFS supports -failover option. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1584) Support explicit failover when automatic failover is enabled

2014-01-14 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13871759#comment-13871759
 ] 

Bikas Saha commented on YARN-1584:
--

bq.  Firstly, it requires manually checking the other RM has actually taken 
over, which in itself is slower than handling it automatically. Then, the 
start-up time for the second RM; the start-up might become an issue if/when the 
Standby and the other services retain/pre-fetch state.
Is the proposal for the active rm to give up being a leader, then monitor that 
someone else becomes a leader. Then do what? If someone else does not become 
leader then what should it do? If someone else becomes the leader then does the 
one who just gave up try to participate in the election again? If yes, then why 
did we ask it to give up in the first place? If we did this to do some 
maintenance on the first RM then how is it different from shutting it down and 
letting auto-failover take its course? If we are doing maintenance on the first 
RM then we cannot help avoid a single RM risk unless we have 3 instances.

Under auto-failover, there is no way one can force an RM to become active all 
by itself. So the documentation of the transitionToActive(FORCE) should state 
that this puts the RM into election but does not guarantee that it will win. 
transitionToStandby() can however guarantee that the RM does stop being active.

Clearly, I am confused as to how this is resulting in ease of use. How about I 
get some help in understanding the exact scenario where this is useful. Is 
there a specific example? What exactly are the chain of events that we think 
should happen?

 Support explicit failover when automatic failover is enabled
 

 Key: YARN-1584
 URL: https://issues.apache.org/jira/browse/YARN-1584
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 YARN-1029 adds automatic failover support. However, users can't explicitly 
 ask for a failover from one RM to the other without stopping the other RM. 
 Stopping the RM until the other RM takes over and then restarting the first 
 RM is more involving and exposes the RM-ensemble to SPOF for a longer 
 duration. 
 It would be nice to allow explicit failover through yarn rmadmin -failover 
 command.
 PS: HDFS supports -failover option. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1584) Support explicit failover when automatic failover is enabled

2014-01-10 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868069#comment-13868069
 ] 

Karthik Kambatla commented on YARN-1584:


bq. Explicitly asking an RM to failover is where the RM gives up control and 
does not participate in leader election such that the other RM can take over. 
Right?
That is correct. Doing it through the -failover option would allow us to limit 
the window of SPOF to pre-configured duration. Manually stopping the RM and 
checking the other RM has taken over through the Web UI or other means and 
restarting the first RM is involving and would take a lot longer. 

Further, failover seems like a safer option to users than shutting down the 
Active RM - no? 

Another way to allow this manual failover would be by the admin explicitly 
transitioning the first RM to standby and the second RM to active. However, 
that would lead to an inconsistency between the state of the 
ActiveStandbyElector used for automatic failover and the HA states of the RMs 
participating. 

 Support explicit failover when automatic failover is enabled
 

 Key: YARN-1584
 URL: https://issues.apache.org/jira/browse/YARN-1584
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 YARN-1029 adds automatic failover support. However, users can't explicitly 
 ask for a failover from one RM to the other without stopping the other RM. 
 Stopping the RM until the other RM takes over exposes the RM-ensemble to SPOF 
 for the duration. 
 It would be nice to allow explicit failover through yarn rmadmin -failover 
 command.
 PS: HDFS supports -failover option. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (YARN-1584) Support explicit failover when automatic failover is enabled

2014-01-10 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868231#comment-13868231
 ] 

Bikas Saha commented on YARN-1584:
--

The duration of failover depends on how long ZK needs to figure out that the 
leader is gone. Then notifying the new leader. Then new leader reading state.
Its not clear to me how any of these steps are faster with a admin failover 
option.

Not quite. When the RM is asked to transition to active via the AdminService 
(FORCE_USER) flag, then the AdminService can transition to standby and then 
notify the elector to quitElection(). That API is present on the elector for 
this specific purpose. The elector gives up participation in the leader 
election process. This RM will remain in standby (because the elector is not 
going to notify it anymore) until the admin ask it to 
transitionToActive(FORCE_USER). Later, when the AdminService is asked to 
transitionToActive() it can call the joinElection API on the elector to rejoin 
the leader election and stay in the Standby state. The elector will join the 
election and notify the RM to transitionToActive if it wins the election.

 Support explicit failover when automatic failover is enabled
 

 Key: YARN-1584
 URL: https://issues.apache.org/jira/browse/YARN-1584
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 YARN-1029 adds automatic failover support. However, users can't explicitly 
 ask for a failover from one RM to the other without stopping the other RM. 
 Stopping the RM until the other RM takes over and then restarting the first 
 RM is more involving and exposes the RM-ensemble to SPOF for a longer 
 duration. 
 It would be nice to allow explicit failover through yarn rmadmin -failover 
 command.
 PS: HDFS supports -failover option. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)