[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999020#comment-13999020 ] Hudson commented on YARN-1861: -- SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5605/]) YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was causing both RMs to be stuck in standby mode when automatic failover is enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594356) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Fix For: 2.4.1 > > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993978#comment-13993978 ] Karthik Kambatla commented on YARN-1861: Thanks a bunch for writing the test for this, Xuan. A couple of nits in the test. # Nit: Would start the string with a capital letter. (Okay with not fixing this) {code} +RMFatalEvent event = +new RMFatalEvent(RMFatalEventType.STATE_STORE_FENCED, + "fake RMFatalEvent"); {code} # Nit: Typo in variable name and decrementing too In lieu of mail notifications not working, let me post a patch fixing these nits. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998189#comment-13998189 ] Hudson commented on YARN-1861: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1779 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1779/]) YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was causing both RMs to be stuck in standby mode when automatic failover is enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594356) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Fix For: 2.4.1 > > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998227#comment-13998227 ] Hudson commented on YARN-1861: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1753 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1753/]) YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was causing both RMs to be stuck in standby mode when automatic failover is enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594356) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Fix For: 2.4.1 > > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994014#comment-13994014 ] Karthik Kambatla commented on YARN-1861: I am obviously a +1 because I wrote the patch. Can someone other than Xuan and me take a look? > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996807#comment-13996807 ] Hadoop QA commented on YARN-1861: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644515/YARN-1861.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3744//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3744//console This message is automatically generated. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996749#comment-13996749 ] Vinod Kumar Vavilapalli commented on YARN-1861: --- Okay, that's much better. +1. Will check this in once Jenkins says okay.. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995700#comment-13995700 ] Xuan Gong commented on YARN-1861: - bq. I tried to just apply the test-case and run it without the core change and was expecting the active RM to go to standby and the standby RM to go to active once the originally active RM is fenced. Instead I get a NPE somewhere. Can the test be fixed to do so? In the testcase, I manually send the RMFatalEvent with RMFatalEventType.STATE_STORE_FENCED to current active RM(rm1). This active RM will handle this event, and transit to Standby. Both of the RMs are in standby state, while the zk still thinks that rm1 is at active state. So, it will not trigger the leader election. I think this can mimic the behavior as we described previously. Without the core code change, this testcase will fail. Because NM is trying to connect the active RM, but neither of two RMs are active. So, the NPE is expected. bq. Also, we need to make sure that when automatic failover is enabled, all external interventions like a fence like this bug (and forced-manual failover from CLI?) do a similar reset into the leader election. There may not be cases like this today though.. For the external interventions for automatic failover right now , we have transitionToActive/transitionToStandby plus forcemanual from CLI. The current behaviors are if we do transitionToActive + forcemanual + current standby rm id. The standby rm will transit to Active. In the mean time, it will do the fence, and current active rm will transit to Standby. If there are any exceptions, the rm will either be terminated or go back to standby state which will reset the leader election. Both of the cases, the zk will trigger a new run of leader election. If we do transitionToStandby + forcemanual + current active rm id. Both of rms are in standby state. Another transitionToActive command is needed. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995779#comment-13995779 ] Hadoop QA commented on YARN-1861: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12644204/yarn-1861-6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3737//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3737//console This message is automatically generated. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995781#comment-13995781 ] Karthik Kambatla commented on YARN-1861: bq. That is what I was thinking, but I am concerned about locking etc. This code has become a little convoluted. Agree. I did consider going that route, but was worried about the maintainability. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995760#comment-13995760 ] Vinod Kumar Vavilapalli commented on YARN-1861: --- bq. Without the core code change, this testcase will fail. Because NM is trying to connect the active RM, but neither of two RMs are active. So, the NPE is expected. Can we make this explicit, instead of being an NPE? Like doing a client call to find the current active RM or something like that? Tx for the explanation of all the cases, Xuan. bq. That looks hacky, but doesn't require new external interventions to explicitly handle it. Vinod Kumar Vavilapalli - do you think that would be a better approach? That is what I was thinking, but I am concerned about locking etc. This code has become a little convoluted. Per Xuan, we seem to be safe for now, so may be look at this separately? > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995743#comment-13995743 ] Karthik Kambatla commented on YARN-1861: bq. Also, we need to make sure that when automatic failover is enabled, all external interventions like a fence like this bug (and forced-manual failover from CLI?) do a similar reset into the leader election. There may not be cases like this today though. One way to future-proof this is to call resetLeaderElection in ResourceManager#transitionToStandby itself. That looks hacky, but doesn't require new external interventions to explicitly handle it. [~vinodkv] - do you think that would be a better approach? > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995796#comment-13995796 ] Xuan Gong commented on YARN-1861: - bq. Can we make this explicit, instead of being an NPE? Like doing a client call to find the current active RM or something like that? Yes, we can do that. DONE bq. That is what I was thinking, but I am concerned about locking etc. This code has become a little convoluted. Per Xuan, we seem to be safe for now, so may be look at this separately? Yes. But I will make a note about it. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995812#comment-13995812 ] Xuan Gong commented on YARN-1861: - Uploaded a new patch, Explicitly throwing the exception, saying " Can not find the active RM", instead of NPE. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988590#comment-13988590 ] Karthik Kambatla commented on YARN-1861: Please wait for me to take a look at this until Sunday evening. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988573#comment-13988573 ] Tsuyoshi OZAWA commented on YARN-1861: -- > We should call rm.adminService.resetLeaderElection() in the finally block. If > rm.transitionToStandby() fails while stoping RM's services, all RM can stuck. Sorry, I noticed this is wrong. If rm.transitionToStandby() fails, RM can stuck until ZK server detects the failure. We can call EmbeddedElectorService.stop() in exception hander to shutdown gracefully, but this is one option. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988529#comment-13988529 ] Tsuyoshi OZAWA commented on YARN-1861: -- [~xgong] Great work. The test case by Xuan checks whether the fix by Karthik works well by injecting RMFatalEventType.STATE_STORE_FENCED directly. My review comments are as follows: {code} // Transition to standby and reinit active services LOG.info("Transitioning RM to Standby mode"); rm.transitionToStandby(true); +rm.adminService.resetLeaderElection(); return; } catch (Exception e) { {code} We should call rm.adminService.resetLeaderElection() in the finally block. If rm.transitionToStandby() fails while stoping RM's services, all RM can stuck. {code} +int maxWaittingAttempt = 20; +while (maxWaittingAttempt -- > 0) { {code} maxWaittingAttempt should be maxWaitingAttempt. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988450#comment-13988450 ] Tsuyoshi OZAWA commented on YARN-1861: -- Thanks for updating patch, Xuan. TestClientRMService failure looks not related to the change, so I filed it on YARN-2018. I'll try to look at the latest patch. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988388#comment-13988388 ] Hadoop QA commented on YARN-1861: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643124/YARN-1861.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3685//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3685//console This message is automatically generated. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > YARN-1861.5.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988071#comment-13988071 ] Hadoop QA commented on YARN-1861: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643078/YARN-1861.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1279 javac compiler warnings (more than the trunk's current 1278 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3681//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3681//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3681//console This message is automatically generated. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987972#comment-13987972 ] Xuan Gong commented on YARN-1861: - [~ozawa] Thanks. Uploaded a new patch based on the latest trunk and fix -1 on findbug > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, > yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987906#comment-13987906 ] Hadoop QA commented on YARN-1861: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12643061/YARN-1861.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3679//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3679//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3679//console This message is automatically generated. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, YARN-1861.3.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983970#comment-13983970 ] Hadoop QA commented on YARN-1861: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642387/YARN-1861.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1279 javac compiler warnings (more than the trunk's current 1278 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3653//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3653//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3653//console This message is automatically generated. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983875#comment-13983875 ] Xuan Gong commented on YARN-1861: - The solution provided by karthik looks good to me. Uploaded a new patch that adds a test case > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: YARN-1861.2.patch, yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983796#comment-13983796 ] Xuan Gong commented on YARN-1861: - Taking this over for adding the testcases. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983698#comment-13983698 ] Xuan Gong commented on YARN-1861: - Take this over > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Xuan Gong >Priority: Blocker > Attachments: yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979661#comment-13979661 ] Hadoop QA commented on YARN-1861: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12641673/yarn-1861-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 3 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3621//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3621//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3621//console This message is automatically generated. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Blocker > Attachments: yarn-1861-1.patch > > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979415#comment-13979415 ] Karthik Kambatla commented on YARN-1861: Taking this over. Figured out the issue - an Active RM doesn't intimate the elector when it transitions itself to Standby. The elector assumes everything is fine with the cluster. The fix is to resetLeaderElection when the RM transitions itself to standby. Posting a patch that does that. Haven't written any tests yet. Will try to make time and write some. If I am not active enough, please feel free to take it over and the tests. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Karthik Kambatla >Priority: Critical > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968520#comment-13968520 ] Tsuyoshi OZAWA commented on YARN-1861: -- Thank you for pointing, Karthik. I'll continue to check code again. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968512#comment-13968512 ] Karthik Kambatla commented on YARN-1861: While it is (theoretically) possible to run into the deadlock [~ozawa] mentioned, I don't think deleting zk locks would have fixed it. So, clearly, there is another issue lurking here. I ll take a look at this, once I am done fixing the deadlock reported on YARN-1929. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968487#comment-13968487 ] Tsuyoshi OZAWA commented on YARN-1861: -- [~kasha], I think this problem looks very similar to YARN-1929 - deadlock after losing ZK session. (*ASE#processResult* -> *EES#becomeStandby* -> *AS#transitionToStandby* -> *RM#transitionToStandby*) and (RM#serviceStop -> RM.super#serviceStop -> *RM.super#stop* -> AS#stop -> *AS#serviceStop* -> *EES#serviceStop* -> *ASE#quitElection*) IIUC, Karthik's patch on YARN-1929 partially solve this problem, but not completely. Please correct me if I get wrong. Thanks. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968211#comment-13968211 ] Rohith commented on YARN-1861: -- Oops, I too encounterd with both RM is standy by state forever :-( The trace is same as Arpit Gupta given in his comment. And another observation is same as Vinod's observation. After deleting lock,leader election started and same RM became active. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942624#comment-13942624 ] Arpit Gupta commented on YARN-1861: --- It was stuck for a while hard to determine the time period now but my guestimate would be at least a few hours if not more. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942619#comment-13942619 ] Karthik Kambatla commented on YARN-1861: Interesting. Good catch, Arpit. I am surprised we can run into this. Just curious, how long have they been stuck? When an RM transitions to standby, the RM is supposed to automatically re-enroll itself in leader-election even if it doesn't lose its own ZK session. If this is not the case, we should fix it. If the RM does that, there shouldn't be a reason for both to be stuck in Standby mode. > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.4.0 >Reporter: Arpit Gupta >Assignee: Vinod Kumar Vavilapalli > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942554#comment-13942554 ] Arpit Gupta commented on YARN-1861: --- Here is a snippet from the log {code} 2014-03-18 09:39:42,544 INFO zookeeper.ClientCnxn (ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server h2-ha-suse-uns-1395117052-2.cs1cloud.internal/172.18.145.62:2181. Will not att empt to authenticate using SASL (unknown error) 2014-03-18 09:39:42,545 INFO zookeeper.ClientCnxn (ClientCnxn.java:primeConnection(849)) - Socket connection established to h2-ha-suse-uns-1395117052-2.cs1cloud.internal/172.18.145.62:2181, initiating sess ion 2014-03-18 09:39:45,437 INFO zookeeper.ClientCnxn (ClientCnxn.java:onConnected(1211)) - Session establishment complete on server h2-ha-suse-uns-1395117052-2.cs1cloud.internal/172.18.145.62:2181, sessionid = 0x144d394247b0005, negotiated timeout = 1 2014-03-18 09:39:47,326 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server. resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2014-03-18 09:39:47,326 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(755)) - ZKRMStateStore Session disconnected 2014-03-18 09:39:47,326 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2014-03-18 09:39:47,327 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(745)) - ZKRMStateStore Session connected 2014-03-18 09:39:47,327 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(751)) - ZKRMStateStore Session restored 2014-03-18 09:39:47,327 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2014-03-18 09:39:47,327 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(755)) - ZKRMStateStore Session disconnected 2014-03-18 09:39:47,327 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED 2014-03-18 09:39:47,327 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(745)) - ZKRMStateStore Session connected 2014-03-18 09:39:47,327 INFO recovery.ZKRMStateStore (ZKRMStateStore.java:processWatchEvent(751)) - ZKRMStateStore Session restored 2014-03-18 09:39:47,328 FATAL resourcemanager.ResourceManager (ResourceManager.java:handle(652)) - Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_FENCED. Cause: org.apache.hadoop.yarn.server.resourcemanager.recovery.StoreFencedException: RMStateStore has been fenced at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:880) 2014-03-18 09:39:47,328 INFO resourcemanager.ResourceManager (ResourceManager.java:handle(656)) - RMStateStore has been fenced 2014-03-18 09:39:47,328 INFO resourcemanager.ResourceManager (ResourceManager.java:handle(660)) - Transitioning RM to Standby mode 2014-03-18 09:39:47,328 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(872)) - Transitioning to standby state {code} > Both RM stuck in standby mode when automatic failover is enabled > > > Key: YARN-1861 > URL: https://issues.apache.org/jira/browse/YARN-1861 > Project: Hadoop YARN > Issue Type: Bug >Affects Versions: 2.4.0 >Reporter: Arpit Gupta > > In our HA tests we noticed that the tests got stuck because both RM's got > into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)