[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated YARN-1861:
------------------------------------------

    Component/s: resourcemanager
       Assignee: Vinod Kumar Vavilapalli

I debugged this for a while with Arpit's help. I think this can happen because 
we have two zookeeper sessions inside the RM and one of them can fail while the 
other is still alive. 

In this case, RM1 lost the ZK session inside the ZKRMStateStore but the session 
inside the leader-election code was still active. RM1 thus got stuck in standby 
mode, RM2 was anyways already in standby mode and the cluster was stuck.

When I manually deleted the ZK locks, leader election kicked back in and RM1 
itself became active again.

> Both RM stuck in standby mode when automatic failover is enabled
> ----------------------------------------------------------------
>
>                 Key: YARN-1861
>                 URL: https://issues.apache.org/jira/browse/YARN-1861
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Vinod Kumar Vavilapalli
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to