[
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995700#comment-13995700
]
Xuan Gong commented on YARN-1861:
---------------------------------
bq. I tried to just apply the test-case and run it without the core change and
was expecting the active RM to go to standby and the standby RM to go to active
once the originally active RM is fenced. Instead I get a NPE somewhere. Can the
test be fixed to do so?
In the testcase, I manually send the RMFatalEvent with
RMFatalEventType.STATE_STORE_FENCED to current active RM(rm1). This active RM
will handle this event, and transit to Standby. Both of the RMs are in standby
state, while the zk still thinks that rm1 is at active state. So, it will not
trigger the leader election. I think this can mimic the behavior as we
described previously. Without the core code change, this testcase will fail.
Because NM is trying to connect the active RM, but neither of two RMs are
active. So, the NPE is expected.
bq. Also, we need to make sure that when automatic failover is enabled, all
external interventions like a fence like this bug (and forced-manual failover
from CLI?) do a similar reset into the leader election. There may not be cases
like this today though..
For the external interventions for automatic failover right now , we have
transitionToActive/transitionToStandby plus forcemanual from CLI. The current
behaviors are if we do transitionToActive + forcemanual + current standby rm
id. The standby rm will transit to Active. In the mean time, it will do the
fence, and current active rm will transit to Standby. If there are any
exceptions, the rm will either be terminated or go back to standby state which
will reset the leader election. Both of the cases, the zk will trigger a new
run of leader election.
If we do transitionToStandby + forcemanual + current active rm id. Both of rms
are in standby state. Another transitionToActive command is needed.
> Both RM stuck in standby mode when automatic failover is enabled
> ----------------------------------------------------------------
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.4.0
> Reporter: Arpit Gupta
> Assignee: Karthik Kambatla
> Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch,
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got
> into standby state and no one became active.
--
This message was sent by Atlassian JIRA
(v6.2#6252)