[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199489#comment-14199489
 ] 

Jian He commented on YARN-2579:
---

[~kasha], thanks for your review. I'm committing  this. 

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199475#comment-14199475
 ] 

Karthik Kambatla commented on YARN-2579:


Glad to see we are getting rid of the RMFatalEventDispatcher. I am assuming we 
want to keep the changes to a minimum in this patch, and do a follow-up JIRA to 
clean this up better. I would love to work on the follow-up; noticed a few 
discrepancies while working on YARN-2010, that continue to exist with this 
patch as well. 

Functionally, the patch looks good to me. In the interest of unblocking 2.6, I 
am +1 to committing it as well, but would like to point out some follow-up work 
that I see. Filed YARN-2814 to work on these items.

I see the following follow-up items to simplify the surrounding code and 
improve readability, if we do commit the existing patch.
# Get rid of RMFatalEventDispatcher and RMFatalEvent* altogether.
# Given all other events are specific to RMActiveServices, we should move the 
dipatcher also into RMActiveServices.
# I am not a fan of having a pointer to the RM in the store as well, 
particularly since we have RMContext primarily to hold the information other 
classes need. I am concerned about more classes needing this information in the 
future. 
# Add a shutdownOrTransitionToStandby method in the RM to transparently handle 
non-HA and HA cases.
# Unrelated to this patch: we should make the existing 
{{transitionToStandby(boolean)}} private, and add a package-private 
{{transitionToStandby()}} to be called from AdminService and 
EmbeddedElectorService. 
# Instead of calling ExitUtil#terminate at multiple places in the RM, we should 
have a {{protected shutdown()}} method that does this and can be overridden in 
MockRM for better testing. 

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199414#comment-14199414
 ] 

Karthik Kambatla commented on YARN-2579:


Was caught up all day. Looking now.. 

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199408#comment-14199408
 ] 

Vinod Kumar Vavilapalli commented on YARN-2579:
---

[~jianhe] / [~kasha] / [~rohithsharma], can we get this in now?

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199187#comment-14199187
 ] 

Jian He commented on YARN-2579:
---

+1, [~kasha], wanna take a look ?

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198960#comment-14198960
 ] 

Hadoop QA commented on YARN-2579:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12679587/YARN-2579-20141105.3.patch
  against trunk revision 1831280.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5740//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5740//console

This message is automatically generated.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198910#comment-14198910
 ] 

Rohith commented on YARN-2579:
--

Checked tests failures, not related to this patch fix. And I ran in my machine, 
it is passing!!

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198886#comment-14198886
 ] 

Hadoop QA commented on YARN-2579:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12679578/YARN-2579-20141105.2.patch
  against trunk revision 6e8722e.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  org.apache.hadoop.yarn.server.resourcemanager.TestRM
  org.apache.hadoop.yarn.server.resourcemanager.TestAppManager

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5739//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5739//console

This message is automatically generated.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198806#comment-14198806
 ] 

Rohith commented on YARN-2579:
--

I got refactored change.:-)

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198789#comment-14198789
 ] 

Jian He commented on YARN-2579:
---

thanks Rohith, looks good overall, I went ahead did very minor refactoring in 
{{RMStateStore#notifyStoreOperationFailed}} method.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, 
> YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198729#comment-14198729
 ] 

Arun C Murthy commented on YARN-2579:
-

Can we please get this in today? Tx.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198459#comment-14198459
 ] 

Karthik Kambatla commented on YARN-2579:


I would like to take a look at the final patch before it gets committed. 

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197952#comment-14197952
 ] 

Rohith commented on YARN-2579:
--

bq. We should avoid these dispatcher events trying to close the dispatchers - 
that was why I suggested a separate thread (my point 2.2 in the proposal
I missed in YARN-2579-20141105.patch.  Later I updated patch considering 
separate thread i.e YARN-2579-20141105.1.patch. 
{code}
-  type = RMFatalEventType.STATE_STORE_FENCED;
+  Thread standByTransitionThread =
+  new Thread(new StandByTransitionThread());
+  standByTransitionThread.setName("StandByTransitionThread Handler");
+  standByTransitionThread.start();
{code}


> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197945#comment-14197945
 ] 

Hadoop QA commented on YARN-2579:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12679489/YARN-2579-20141105.1.patch
  against trunk revision 73068f6.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5736//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5736//console

This message is automatically generated.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197890#comment-14197890
 ] 

Vinod Kumar Vavilapalli commented on YARN-2579:
---

Tx for working on this [~rohithsharma]!

The abstractions are all broken in this part of the code-base, but it's not 
your fault. Given this is a blocker, your approach to minimize the changes is 
good!

One comment: This still is invoking transitionToStandby in the RMStateStore's 
dispatcher. So what we will see is the following
{code}
RMStateStoreDispatcher.handle() -> store fails in the event, generates a 
notifyStoreOperationFailed -> invokes resourceManager.handleTransitionToStandBy 
-> calls transitionToStandby(boolean) -> activeServices.stop() -> 
stateStore.close() -> RMStateStoreDispatcher.stop()
{code}

We should avoid these dispatcher events trying to close the dispatchers - that 
was why I suggested a separate thread (my point 2.2 in the proposal).

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197842#comment-14197842
 ] 

Rohith commented on YARN-2579:
--

I think let this class exist for exitting jvm instead of calling exit in 
different placess.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-05 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197828#comment-14197828
 ] 

Jian He commented on YARN-2579:
---

Quick Scan, we may just remove  RMFatalEventDispatcher class completely ?

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197805#comment-14197805
 ] 

Rohith commented on YARN-2579:
--

Updated the test as per new code. Please review latest patch.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, 
> YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197804#comment-14197804
 ] 

Hadoop QA commented on YARN-2579:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12679478/YARN-2579-20141105.patch
  against trunk revision 73068f6.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5734//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5734//console

This message is automatically generated.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197671#comment-14197671
 ] 

Rohith commented on YARN-2579:
--

Thanks [~jianhe] for clarifying doubt.
I updated the patch with above changes. Please review the patch.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579-20141105.patch, YARN-2579.patch, 
> YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-04 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197634#comment-14197634
 ] 

Jian He commented on YARN-2579:
---

bq. The main event dispatcher should be limited to handle events coming from 
active service.
EmbeddedElectorService is now not inside active service. 
{{EmbeddedElectorService#notifyFatalError}} is now sending event to the main 
dispatcher which locks RM and does the transitionToStandBy, it could just do it 
synchronously.
I think the main point is to get rid of 
{{ResourceManager#RMFatalEventDispatcher}}. Also, RMStateStore could create a 
separate thread to do the transition.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-04 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197621#comment-14197621
 ] 

Rohith commented on YARN-2579:
--

Thanks Vinod!!
Am trying to understand the proposal,
bq. 1.The main event dispatcher should be limited to handle events coming from 
active service. That way none of those events lock the resourcemanager itself
Currenlty also, maiin event dispatcher is hadling events from active services. 
Am I missing anything?

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-11-04 Thread Vinod Kumar Vavilapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197409#comment-14197409
 ] 

Vinod Kumar Vavilapalli commented on YARN-2579:
---

Just got a summary of this from [~jianhe].

I think the fundamental problem is the main event dispatcher handling events 
(RMFatalEventType) that can take a lock on ResourceManager

I propose the following
 # The main event dispatcher should be limited to handle events coming from 
active service. That way none of those events lock the resourcemanager itself.
 # State Store and Embedded elector DO NOT use the dispatcher to transition RM 
(This is because Dispatcher itself is an active service).
## Embedded elector can always synchronously transition RM state
## State store can spawn a separate thread to transition RM state. We can 
take a short-cut by transitioning RM state inside the StateStore's dispatcher 
itself, but eventually that event will try to close the StateStore - so we 
should avoid this.
 # StateStore sending out a fatal event and then proceeding ahead to do more 
state-store writes doesn't make sense. Once the StateStore sees a fatal event, 
it should go into a RMStateStoreState.SHUTDOWN state and stop processing any 
more events.

We can do (3) in a separate patch to reduce scope.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-29 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189583#comment-14189583
 ] 

Rohith commented on YARN-2579:
--

Thanks Karthink!! 
bq. (Service)Dispatcher.stop() wait for draining out RMFatalEventDispatcher 
event
I was meant to say that drained event i.e RMFatalEvent is been waiting to be 
finished at {{rmDispatcher.stop()}}  in {{eventHandlerThread.join}}.

bq. {{dispatch(event)}} in AsyncDispatcher#createThread doesn't have a 
try-catch block 
{{dispatch(event)}}  method catch throwable and exit the JVM. But I see if 
handler's are not registered , then we must have try-catch block. do you meant 
for this scenario?

bq. {{eventHandlerThread.join}} in serviceStop should take a timeout as well
+1 for this approach too, this also fixes hang problem. The attached patch too 
does not bring Rm to hang in a kind of deadlock mode.

bq. With the current patch, I wonder if there are any unexpected side-effects
I have verified many switching scenarios as I mentioned in previous comment and 
more deployed in real cluster. It is working fine with work preserving restart 
too.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>Priority: Blocker
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-29 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189195#comment-14189195
 ] 

Karthik Kambatla commented on YARN-2579:


Thanks, [~rohithsharma]. Looking at the tests and your explanation, I think I 
see what you are saying. 

However, looking into the code, I am not convinced it is draining out that is 
causing this issue. {{rmDispatcher}} is an {{AsyncDispatcher}}, with 
{{drainEventsOnStop}} always false. So, {{rmDispatcher.stop()}} shouldn't lead 
to any draining of events. I noticed a couple of other issues in the 
AsyncDispatcher code:
# {{eventHandlerThread.join}} in serviceStop should take a timeout as well
# {{dispatch(event)}} in AsyncDispatcher#createThread doesn't have a try-catch 
block 

With the current patch, I wonder if there are any unexpected side-effects. 

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-20 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177886#comment-14177886
 ] 

Rohith commented on YARN-2579:
--

bq. Under what conditions, can resetDispatcher be called by two threads 
simultaneously? 
resetDispatcher is called only once in synchronized block(transitionToStandBy 
or transitinedToActive). 

Here the problem is , 
*Thread-1 :* just before stoppingActiveServices() from trainsitionToStandBy() 
method if RMFatalEvent is thrown then RMFatalEventDispatcher wait for 
trainsitionToStandBy() for obtaining lock.RMFatalEventDispatcher is BLOCKED on 
trainsitionToStandBy().
*Thread-2 :* From the elector, trainsitionedTotandBy() stops dispatcher in 
resetDispatcher() method. (Service)Dispatcher.stop() wait for draining out 
RMFatalEventDispatcher event.But "AsyncDispatcher event handler" is WAITING on 
dispatcher thread to finish.


> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-20 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177003#comment-14177003
 ] 

Karthik Kambatla commented on YARN-2579:


[~rohithsharma] - can you help me understand the issue here better. 

{{resetDispatcher}} is called either in transitionToStandby and 
transitionToActive, both of which are synchronized methods. Under what 
conditions, can {{resetDispatcher}} be called by two threads simultaneously? 

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173273#comment-14173273
 ] 

Rohith commented on YARN-2579:
--

Hi [~vinodkv], [~kasha], [~jianhe] 
 Can this issue fix goes into release 2.6.0 please? If it appears,then one 
of the RM will be in hanged state. It is kind of deadlock between 2 threads 
where jstack does not show as deadlock.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-15 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172442#comment-14172442
 ] 

Hadoop QA commented on YARN-2579:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12674998/YARN-2579.patch
  against trunk revision 128ace1.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

  {color:red}-1 javac{color}.  The applied patch generated 1267 javac 
compiler warnings (more than the trunk's current 1266 warnings).

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:

  
org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5406//testReport/
Javac warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/5406//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5406//console

This message is automatically generated.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-15 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172369#comment-14172369
 ] 

Rohith commented on YARN-2579:
--

I updated the patch with test that simulates transitionToStandBy causes RM to 
hang in specific flow of events from ZK.
Please review the patch.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch, YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-07 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163057#comment-14163057
 ] 

Hadoop QA commented on YARN-2579:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12673338/YARN-2579.patch
  against trunk revision 1efd9c9.

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 2.0.3) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/5320//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5320//console

This message is automatically generated.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-07 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161781#comment-14161781
 ] 

Rohith commented on YARN-2579:
--

Verified manually for below tests using help of eclipse debug point.
1. Call transitionToStandBy from admin service obtaining RM lock, and at same 
time RMFatalEventDispatcher wait for RM lock to transtionToStandBy(This issue 
scenario)
2. Call transitionToStandBy from RMFatalEventDispatcher obtaining RM lock, and 
at same time admin service wait for RM lock to transtionToStandBy.

Please review patch.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-10-07 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161774#comment-14161774
 ] 

Rohith commented on YARN-2579:
--

Considering 1 st approach as feasible, I attached patch. Thinking of how do I 
write tests!!




> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
> Attachments: YARN-2579.patch
>
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-09-23 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144550#comment-14144550
 ] 

Rohith commented on YARN-2579:
--

For fixing this, approaches I can think of are
1. we can call ((Service) rmDispatcher).stop(); in separate thread, so current 
lock on transitionToStandby() will be released and RMFatalEventDispatcher holds 
the lock. By this time, RM state already in standby state.

2. Instead of resetting new async diapatcher, let maintain single dispatcher 
for period of jvm life. There should mechanism for clearing queued events in 
diapatcher, so dispatcher should not process .

3. Set separate dispatcher thread for all RMStateStore events.

Please suggest your opinion this bug fix.

> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>Assignee: Rohith
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.

2014-09-22 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143200#comment-14143200
 ] 

Rohith commented on YARN-2579:
--

This scenario could ocure if 2 thread trying to access 
ResourceManager#transitionToStandby().One is from 
AdminService#trainsitiontostandby first and then 
RMFatalEventDispatcher#transitionToStandBy(). This I simulated using debug 
point.
The main problem is in resetting dispatcher, stops the dispatcher. Suppose, if 
AdminService is stopping dispatcher but dispatcher thread is blocked for 
getting acquire lock on ResourceManager, then ResourceManager never get 
transitioned to StandBy. It wait infinitely.

{code}
"AsyncDispatcher event handler" daemon prio=10 tid=0x007ea000 
nid=0x39d1 waiting for monitor entry [0x7fe0a77f6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:976)
- waiting to lock <0xc1f7d438> (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:701)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:678)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
"IPC Server handler 0 on 45021" daemon prio=10 tid=0x7fe0a9026800 
nid=0x30ab in Object.wait() [0x7fe0a7cfa000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0xeb3310e8> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1281)
- locked <0xeb3310e8> (a java.lang.Thread)
at java.lang.Thread.join(Thread.java:1355)
at 
org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150)
at 
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
- locked <0xeb32fef8> (a java.lang.Object)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetDispatcher(ResourceManager.java:1166)
at 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:987)
- locked <0xc1f7d438> (a 
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
at 
org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:308)
- locked <0xc2038d10> (a 
org.apache.hadoop.yarn.server.resourcemanager.AdminService)
at 
org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119)
at 
org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4462)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007)
{code}


> Both RM's state is Active , but 1 RM is not really active.
> --
>
> Key: YARN-2579
> URL: https://issues.apache.org/jira/browse/YARN-2579
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.5.1
>Reporter: Rohith
>
> I encountered a situaltion where both RM's web page was able to access and 
> its state displayed as Active. But One of the RM's ActiveServices were 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)