[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995700#comment-13995700
 ] 

Xuan Gong commented on YARN-1861:
---------------------------------

bq. I tried to just apply the test-case and run it without the core change and 
was expecting the active RM to go to standby and the standby RM to go to active 
once the originally active RM is fenced. Instead I get a NPE somewhere. Can the 
test be fixed to do so?

In the testcase, I manually send the RMFatalEvent with 
RMFatalEventType.STATE_STORE_FENCED to current active RM(rm1). This active RM 
will handle this event, and transit to Standby. Both of the RMs are in standby 
state, while the zk still thinks that rm1 is at active state. So, it will not 
trigger the leader election. I think this can mimic the behavior as we 
described previously. Without the core code change, this testcase will fail. 
Because NM is trying to connect the active RM, but neither of two RMs are 
active. So, the NPE is expected. 

bq. Also, we need to make sure that when automatic failover is enabled, all 
external interventions like a fence like this bug (and forced-manual failover 
from CLI?) do a similar reset into the leader election. There may not be cases 
like this today though..

For the external interventions for automatic failover right now , we have 
transitionToActive/transitionToStandby plus forcemanual from CLI. The current 
behaviors are if we do transitionToActive + forcemanual + current standby rm 
id. The standby rm will transit to Active. In the mean time, it will do the 
fence, and current active rm will transit to Standby. If there are any 
exceptions, the rm will either be terminated or go back to standby state which 
will reset the leader election. Both of the cases, the zk will trigger a new 
run of leader election.

If we do transitionToStandby + forcemanual + current active rm id. Both of rms 
are in standby state. Another transitionToActive command is needed.



> Both RM stuck in standby mode when automatic failover is enabled
> ----------------------------------------------------------------
>
>                 Key: YARN-1861
>                 URL: https://issues.apache.org/jira/browse/YARN-1861
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to