[
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988529#comment-13988529
]
Tsuyoshi OZAWA commented on YARN-1861:
--------------------------------------
[~xgong] Great work. The test case by Xuan checks whether the fix by Karthik
works well by injecting RMFatalEventType.STATE_STORE_FENCED directly.
My review comments are as follows:
{code}
// Transition to standby and reinit active services
LOG.info("Transitioning RM to Standby mode");
rm.transitionToStandby(true);
+ rm.adminService.resetLeaderElection();
return;
} catch (Exception e) {
{code}
We should call rm.adminService.resetLeaderElection() in the finally block. If
rm.transitionToStandby() fails while stoping RM's services, all RM can stuck.
{code}
+ int maxWaittingAttempt = 20;
+ while (maxWaittingAttempt -- > 0) {
{code}
maxWaittingAttempt should be maxWaitingAttempt.
> Both RM stuck in standby mode when automatic failover is enabled
> ----------------------------------------------------------------
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.4.0
> Reporter: Arpit Gupta
> Assignee: Xuan Gong
> Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch,
> YARN-1861.5.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got
> into standby state and no one became active.
--
This message was sent by Atlassian JIRA
(v6.2#6252)