[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-16 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13999020#comment-13999020
 ] 

Hudson commented on YARN-1861:
--

SUCCESS: Integrated in Hadoop-trunk-Commit #5605 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/5605/])
YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was 
causing both RMs to be stuck in standby mode when automatic failover is 
enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594356)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.4.1
>
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-15 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13993978#comment-13993978
 ] 

Karthik Kambatla commented on YARN-1861:


Thanks a bunch for writing the test for this, Xuan. A couple of nits in the 
test.

# Nit: Would start the string with a capital letter. (Okay with not fixing this)
{code}
+RMFatalEvent event =
+new RMFatalEvent(RMFatalEventType.STATE_STORE_FENCED,
+  "fake RMFatalEvent");
{code}
# Nit: Typo in variable name and decrementing too

In lieu of mail notifications not working, let me post a patch fixing these 
nits.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998189#comment-13998189
 ] 

Hudson commented on YARN-1861:
--

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1779 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1779/])
YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was 
causing both RMs to be stuck in standby mode when automatic failover is 
enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594356)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.4.1
>
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13998227#comment-13998227
 ] 

Hudson commented on YARN-1861:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1753 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1753/])
YARN-1861. Fixed a bug in RM to reset leader-election on fencing that was 
causing both RMs to be stuck in standby mode when automatic failover is 
enabled. Contributed by Karthik Kambatla and Xuan Gong. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1594356)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/TestRMFailover.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/EmbeddedElectorService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/MiniYARNCluster.java


> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Fix For: 2.4.1
>
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994014#comment-13994014
 ] 

Karthik Kambatla commented on YARN-1861:


I am obviously a +1 because I wrote the patch. Can someone other than Xuan and 
me take a look? 

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996807#comment-13996807
 ] 

Hadoop QA commented on YARN-1861:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644515/YARN-1861.7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3744//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3744//console

This message is automatically generated.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-13 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13996749#comment-13996749
 ] 

Vinod Kumar Vavilapalli commented on YARN-1861:
---

Okay, that's much better. +1. Will check this in once Jenkins says okay..

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-13 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995700#comment-13995700
 ] 

Xuan Gong commented on YARN-1861:
-

bq. I tried to just apply the test-case and run it without the core change and 
was expecting the active RM to go to standby and the standby RM to go to active 
once the originally active RM is fenced. Instead I get a NPE somewhere. Can the 
test be fixed to do so?

In the testcase, I manually send the RMFatalEvent with 
RMFatalEventType.STATE_STORE_FENCED to current active RM(rm1). This active RM 
will handle this event, and transit to Standby. Both of the RMs are in standby 
state, while the zk still thinks that rm1 is at active state. So, it will not 
trigger the leader election. I think this can mimic the behavior as we 
described previously. Without the core code change, this testcase will fail. 
Because NM is trying to connect the active RM, but neither of two RMs are 
active. So, the NPE is expected. 

bq. Also, we need to make sure that when automatic failover is enabled, all 
external interventions like a fence like this bug (and forced-manual failover 
from CLI?) do a similar reset into the leader election. There may not be cases 
like this today though..

For the external interventions for automatic failover right now , we have 
transitionToActive/transitionToStandby plus forcemanual from CLI. The current 
behaviors are if we do transitionToActive + forcemanual + current standby rm 
id. The standby rm will transit to Active. In the mean time, it will do the 
fence, and current active rm will transit to Standby. If there are any 
exceptions, the rm will either be terminated or go back to standby state which 
will reset the leader election. Both of the cases, the zk will trigger a new 
run of leader election.

If we do transitionToStandby + forcemanual + current active rm id. Both of rms 
are in standby state. Another transitionToActive command is needed.



> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995779#comment-13995779
 ] 

Hadoop QA commented on YARN-1861:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12644204/yarn-1861-6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3737//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3737//console

This message is automatically generated.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995781#comment-13995781
 ] 

Karthik Kambatla commented on YARN-1861:


bq. That is what I was thinking, but I am concerned about locking etc. This 
code has become a little convoluted.
Agree. I did consider going that route, but was worried about the 
maintainability. 

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995760#comment-13995760
 ] 

Vinod Kumar Vavilapalli commented on YARN-1861:
---

bq. Without the core code change, this testcase will fail. Because NM is trying 
to connect the active RM, but neither of two RMs are active. So, the NPE is 
expected.
Can we make this explicit, instead of being an NPE? Like doing a client call to 
find the current active RM or something like that?

Tx for the explanation of all the cases, Xuan.

bq. That looks hacky, but doesn't require new external interventions to 
explicitly handle it. Vinod Kumar Vavilapalli - do you think that would be a 
better approach?
That is what I was thinking, but I am concerned about locking etc. This code 
has become a little convoluted. Per Xuan, we seem to be safe for now, so may be 
look at this separately?

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995743#comment-13995743
 ] 

Karthik Kambatla commented on YARN-1861:


bq. Also, we need to make sure that when automatic failover is enabled, all 
external interventions like a fence like this bug (and forced-manual failover 
from CLI?) do a similar reset into the leader election. There may not be cases 
like this today though.
One way to future-proof this is to call resetLeaderElection in 
ResourceManager#transitionToStandby itself. That looks hacky, but doesn't 
require new external interventions to explicitly handle it. [~vinodkv] - do you 
think that would be a better approach?

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995796#comment-13995796
 ] 

Xuan Gong commented on YARN-1861:
-

bq. Can we make this explicit, instead of being an NPE? Like doing a client 
call to find the current active RM or something like that?

Yes, we can do that. DONE

bq. That is what I was thinking, but I am concerned about locking etc. This 
code has become a little convoluted. Per Xuan, we seem to be safe for now, so 
may be look at this separately?

Yes. But I will make a note about it. 


> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-12 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13995812#comment-13995812
 ] 

Xuan Gong commented on YARN-1861:
-

Uploaded a new patch, Explicitly throwing the exception, saying " Can not find 
the active RM", instead of NPE.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, YARN-1861.7.patch, yarn-1861-1.patch, yarn-1861-6.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988590#comment-13988590
 ] 

Karthik Kambatla commented on YARN-1861:


Please wait for me to take a look at this until Sunday evening. 

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988573#comment-13988573
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--

> We should call rm.adminService.resetLeaderElection() in the finally block. If 
> rm.transitionToStandby() fails while stoping RM's services, all RM can stuck.

Sorry, I noticed this is wrong. If rm.transitionToStandby() fails, RM can stuck 
until ZK server detects the failure. We can call EmbeddedElectorService.stop() 
in exception hander to shutdown gracefully, but this is one option.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988529#comment-13988529
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--

[~xgong] Great work. The test case by Xuan checks whether the fix by Karthik 
works well by injecting RMFatalEventType.STATE_STORE_FENCED directly.

My review comments are as follows:
{code}
 // Transition to standby and reinit active services
 LOG.info("Transitioning RM to Standby mode");
 rm.transitionToStandby(true);
+rm.adminService.resetLeaderElection();
 return;
   } catch (Exception e) {
{code}

We should call rm.adminService.resetLeaderElection() in the finally block. If 
rm.transitionToStandby() fails while stoping RM's services, all RM can stuck.

{code}
+int maxWaittingAttempt = 20;
+while (maxWaittingAttempt -- > 0) {
{code}

maxWaittingAttempt should be maxWaitingAttempt.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988450#comment-13988450
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--

Thanks for updating patch, Xuan. TestClientRMService failure looks not related 
to the change, so I filed it on YARN-2018. I'll try to look at the latest patch.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988388#comment-13988388
 ] 

Hadoop QA commented on YARN-1861:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643124/YARN-1861.5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3685//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3685//console

This message is automatically generated.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> YARN-1861.5.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988071#comment-13988071
 ] 

Hadoop QA commented on YARN-1861:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643078/YARN-1861.4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1279 javac 
compiler warnings (more than the trunk's current 1278 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3681//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3681//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3681//console

This message is automatically generated.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987972#comment-13987972
 ] 

Xuan Gong commented on YARN-1861:
-

[~ozawa] Thanks.

Uploaded a new patch based on the latest trunk and fix -1 on findbug

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, YARN-1861.4.patch, 
> yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-05-02 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13987906#comment-13987906
 ] 

Hadoop QA commented on YARN-1861:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12643061/YARN-1861.3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3679//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3679//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3679//console

This message is automatically generated.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, YARN-1861.3.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983970#comment-13983970
 ] 

Hadoop QA commented on YARN-1861:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12642387/YARN-1861.2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1279 javac 
compiler warnings (more than the trunk's current 1278 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3653//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3653//artifact/trunk/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3653//console

This message is automatically generated.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983875#comment-13983875
 ] 

Xuan Gong commented on YARN-1861:
-

The solution provided by karthik looks good to me. Uploaded a new patch that 
adds a test case

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: YARN-1861.2.patch, yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983796#comment-13983796
 ] 

Xuan Gong commented on YARN-1861:
-

Taking this over for adding the testcases.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-28 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13983698#comment-13983698
 ] 

Xuan Gong commented on YARN-1861:
-

Take this over

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Xuan Gong
>Priority: Blocker
> Attachments: yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979661#comment-13979661
 ] 

Hadoop QA commented on YARN-1861:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12641673/yarn-1861-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 3 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3621//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3621//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3621//console

This message is automatically generated.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Blocker
> Attachments: yarn-1861-1.patch
>
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-24 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13979415#comment-13979415
 ] 

Karthik Kambatla commented on YARN-1861:


Taking this over. Figured out the issue - an Active RM doesn't intimate the 
elector when it transitions itself to Standby. The elector assumes everything 
is fine with the cluster. The fix is to resetLeaderElection when the RM 
transitions itself to standby. Posting a patch that does that. 

Haven't written any tests yet. Will try to make time and write some. If I am 
not active enough, please feel free to take it over and the tests.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Karthik Kambatla
>Priority: Critical
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968520#comment-13968520
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--

Thank you for pointing, Karthik. I'll continue to check code again.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968512#comment-13968512
 ] 

Karthik Kambatla commented on YARN-1861:


While it is (theoretically) possible to run into the deadlock [~ozawa] 
mentioned, I don't think deleting zk locks would have fixed it. So, clearly, 
there is another issue lurking here. I ll take a look at this, once I am done 
fixing the deadlock reported on YARN-1929.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968487#comment-13968487
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--

[~kasha], I think this problem looks very similar to YARN-1929 - deadlock after 
losing ZK session.

(*ASE#processResult* -> *EES#becomeStandby* -> *AS#transitionToStandby* -> 
*RM#transitionToStandby*) and (RM#serviceStop -> RM.super#serviceStop -> 
*RM.super#stop* -> AS#stop -> *AS#serviceStop* -> *EES#serviceStop* -> 
*ASE#quitElection*)

IIUC, Karthik's patch on YARN-1929 partially solve this problem, but not 
completely. Please correct me if I get wrong. Thanks.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968211#comment-13968211
 ] 

Rohith commented on YARN-1861:
--

Oops, I too encounterd with both RM is standy by state forever :-(  

The trace is same as Arpit Gupta  given in his comment. And another observation 
is same as Vinod's observation. After deleting lock,leader election started and 
same RM became active.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-03-20 Thread Arpit Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942624#comment-13942624
 ] 

Arpit Gupta commented on YARN-1861:
---

It was stuck for a while hard to determine the time period now but my 
guestimate would be at least a few hours if not more.

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Vinod Kumar Vavilapalli
>Priority: Critical
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-03-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942619#comment-13942619
 ] 

Karthik Kambatla commented on YARN-1861:


Interesting. Good catch, Arpit. I am surprised we can run into this. Just 
curious, how long have they been stuck? 

When an RM transitions to standby, the RM is supposed to automatically 
re-enroll itself in leader-election even if it doesn't lose its own ZK session. 
If this is not the case, we should fix it. If the RM does that, there shouldn't 
be a reason for both to be stuck in Standby mode. 

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Sub-task
>  Components: resourcemanager
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>Assignee: Vinod Kumar Vavilapalli
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-03-20 Thread Arpit Gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942554#comment-13942554
 ] 

Arpit Gupta commented on YARN-1861:
---

Here is a snippet from the log

{code}
2014-03-18 09:39:42,544 INFO  zookeeper.ClientCnxn 
(ClientCnxn.java:logStartConnect(966)) - Opening socket connection to server 
h2-ha-suse-uns-1395117052-2.cs1cloud.internal/172.18.145.62:2181. Will not att
empt to authenticate using SASL (unknown error)
2014-03-18 09:39:42,545 INFO  zookeeper.ClientCnxn 
(ClientCnxn.java:primeConnection(849)) - Socket connection established to 
h2-ha-suse-uns-1395117052-2.cs1cloud.internal/172.18.145.62:2181, initiating 
sess
ion
2014-03-18 09:39:45,437 INFO  zookeeper.ClientCnxn 
(ClientCnxn.java:onConnected(1211)) - Session establishment complete on server 
h2-ha-suse-uns-1395117052-2.cs1cloud.internal/172.18.145.62:2181, sessionid
= 0x144d394247b0005, negotiated timeout = 1
2014-03-18 09:39:47,326 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with 
state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.
resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2014-03-18 09:39:47,326 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(755)) - ZKRMStateStore Session 
disconnected
2014-03-18 09:39:47,326 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with 
state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2014-03-18 09:39:47,327 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(745)) - ZKRMStateStore Session connected
2014-03-18 09:39:47,327 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(751)) - ZKRMStateStore Session restored
2014-03-18 09:39:47,327 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with 
state:Disconnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2014-03-18 09:39:47,327 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(755)) - ZKRMStateStore Session 
disconnected
2014-03-18 09:39:47,327 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(737)) - Watcher event type: None with 
state:SyncConnected for path:null for Service 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state 
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2014-03-18 09:39:47,327 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(745)) - ZKRMStateStore Session connected
2014-03-18 09:39:47,327 INFO  recovery.ZKRMStateStore 
(ZKRMStateStore.java:processWatchEvent(751)) - ZKRMStateStore Session restored
2014-03-18 09:39:47,328 FATAL resourcemanager.ResourceManager 
(ResourceManager.java:handle(652)) - Received a 
org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type 
STATE_STORE_FENCED. Cause:
org.apache.hadoop.yarn.server.resourcemanager.recovery.StoreFencedException: 
RMStateStore has been fenced
at 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$VerifyActiveStatusThread.run(ZKRMStateStore.java:880)

2014-03-18 09:39:47,328 INFO  resourcemanager.ResourceManager 
(ResourceManager.java:handle(656)) - RMStateStore has been fenced
2014-03-18 09:39:47,328 INFO  resourcemanager.ResourceManager 
(ResourceManager.java:handle(660)) - Transitioning RM to Standby mode
2014-03-18 09:39:47,328 INFO  resourcemanager.ResourceManager 
(ResourceManager.java:transitionToStandby(872)) - Transitioning to standby state
{code}

> Both RM stuck in standby mode when automatic failover is enabled
> 
>
> Key: YARN-1861
> URL: https://issues.apache.org/jira/browse/YARN-1861
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.4.0
>Reporter: Arpit Gupta
>
> In our HA tests we noticed that the tests got stuck because both RM's got 
> into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)