[
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kumar Vavilapalli updated YARN-1861:
------------------------------------------
Component/s: resourcemanager
Assignee: Vinod Kumar Vavilapalli
I debugged this for a while with Arpit's help. I think this can happen because
we have two zookeeper sessions inside the RM and one of them can fail while the
other is still alive.
In this case, RM1 lost the ZK session inside the ZKRMStateStore but the session
inside the leader-election code was still active. RM1 thus got stuck in standby
mode, RM2 was anyways already in standby mode and the cluster was stuck.
When I manually deleted the ZK locks, leader election kicked back in and RM1
itself became active again.
> Both RM stuck in standby mode when automatic failover is enabled
> ----------------------------------------------------------------
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.4.0
> Reporter: Arpit Gupta
> Assignee: Vinod Kumar Vavilapalli
>
> In our HA tests we noticed that the tests got stuck because both RM's got
> into standby state and no one became active.
--
This message was sent by Atlassian JIRA
(v6.2#6252)