[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199489#comment-14199489 ] Jian He commented on YARN-2579: --- [~kasha], thanks for your review. I'm committing this. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199475#comment-14199475 ] Karthik Kambatla commented on YARN-2579: Glad to see we are getting rid of the RMFatalEventDispatcher. I am assuming we want to keep the changes to a minimum in this patch, and do a follow-up JIRA to clean this up better. I would love to work on the follow-up; noticed a few discrepancies while working on YARN-2010, that continue to exist with this patch as well. Functionally, the patch looks good to me. In the interest of unblocking 2.6, I am +1 to committing it as well, but would like to point out some follow-up work that I see. Filed YARN-2814 to work on these items. I see the following follow-up items to simplify the surrounding code and improve readability, if we do commit the existing patch. # Get rid of RMFatalEventDispatcher and RMFatalEvent* altogether. # Given all other events are specific to RMActiveServices, we should move the dipatcher also into RMActiveServices. # I am not a fan of having a pointer to the RM in the store as well, particularly since we have RMContext primarily to hold the information other classes need. I am concerned about more classes needing this information in the future. # Add a shutdownOrTransitionToStandby method in the RM to transparently handle non-HA and HA cases. # Unrelated to this patch: we should make the existing {{transitionToStandby(boolean)}} private, and add a package-private {{transitionToStandby()}} to be called from AdminService and EmbeddedElectorService. # Instead of calling ExitUtil#terminate at multiple places in the RM, we should have a {{protected shutdown()}} method that does this and can be overridden in MockRM for better testing. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199414#comment-14199414 ] Karthik Kambatla commented on YARN-2579: Was caught up all day. Looking now.. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199408#comment-14199408 ] Vinod Kumar Vavilapalli commented on YARN-2579: --- [~jianhe] / [~kasha] / [~rohithsharma], can we get this in now? > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199187#comment-14199187 ] Jian He commented on YARN-2579: --- +1, [~kasha], wanna take a look ? > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198960#comment-14198960 ] Hadoop QA commented on YARN-2579: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679587/YARN-2579-20141105.3.patch against trunk revision 1831280. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5740//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5740//console This message is automatically generated. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198910#comment-14198910 ] Rohith commented on YARN-2579: -- Checked tests failures, not related to this patch fix. And I ran in my machine, it is passing!! > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198886#comment-14198886 ] Hadoop QA commented on YARN-2579: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679578/YARN-2579-20141105.2.patch against trunk revision 6e8722e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRM org.apache.hadoop.yarn.server.resourcemanager.TestAppManager {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5739//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5739//console This message is automatically generated. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198806#comment-14198806 ] Rohith commented on YARN-2579: -- I got refactored change.:-) > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198789#comment-14198789 ] Jian He commented on YARN-2579: --- thanks Rohith, looks good overall, I went ahead did very minor refactoring in {{RMStateStore#notifyStoreOperationFailed}} method. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.2.patch, > YARN-2579-20141105.3.patch, YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198729#comment-14198729 ] Arun C Murthy commented on YARN-2579: - Can we please get this in today? Tx. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198459#comment-14198459 ] Karthik Kambatla commented on YARN-2579: I would like to take a look at the final patch before it gets committed. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197952#comment-14197952 ] Rohith commented on YARN-2579: -- bq. We should avoid these dispatcher events trying to close the dispatchers - that was why I suggested a separate thread (my point 2.2 in the proposal I missed in YARN-2579-20141105.patch. Later I updated patch considering separate thread i.e YARN-2579-20141105.1.patch. {code} - type = RMFatalEventType.STATE_STORE_FENCED; + Thread standByTransitionThread = + new Thread(new StandByTransitionThread()); + standByTransitionThread.setName("StandByTransitionThread Handler"); + standByTransitionThread.start(); {code} > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197945#comment-14197945 ] Hadoop QA commented on YARN-2579: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679489/YARN-2579-20141105.1.patch against trunk revision 73068f6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5736//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5736//console This message is automatically generated. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197890#comment-14197890 ] Vinod Kumar Vavilapalli commented on YARN-2579: --- Tx for working on this [~rohithsharma]! The abstractions are all broken in this part of the code-base, but it's not your fault. Given this is a blocker, your approach to minimize the changes is good! One comment: This still is invoking transitionToStandby in the RMStateStore's dispatcher. So what we will see is the following {code} RMStateStoreDispatcher.handle() -> store fails in the event, generates a notifyStoreOperationFailed -> invokes resourceManager.handleTransitionToStandBy -> calls transitionToStandby(boolean) -> activeServices.stop() -> stateStore.close() -> RMStateStoreDispatcher.stop() {code} We should avoid these dispatcher events trying to close the dispatchers - that was why I suggested a separate thread (my point 2.2 in the proposal). > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197842#comment-14197842 ] Rohith commented on YARN-2579: -- I think let this class exist for exitting jvm instead of calling exit in different placess. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197828#comment-14197828 ] Jian He commented on YARN-2579: --- Quick Scan, we may just remove RMFatalEventDispatcher class completely ? > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197805#comment-14197805 ] Rohith commented on YARN-2579: -- Updated the test as per new code. Please review latest patch. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.1.patch, YARN-2579-20141105.patch, > YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197804#comment-14197804 ] Hadoop QA commented on YARN-2579: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12679478/YARN-2579-20141105.patch against trunk revision 73068f6. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5734//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5734//console This message is automatically generated. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197671#comment-14197671 ] Rohith commented on YARN-2579: -- Thanks [~jianhe] for clarifying doubt. I updated the patch with above changes. Please review the patch. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579-20141105.patch, YARN-2579.patch, > YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197634#comment-14197634 ] Jian He commented on YARN-2579: --- bq. The main event dispatcher should be limited to handle events coming from active service. EmbeddedElectorService is now not inside active service. {{EmbeddedElectorService#notifyFatalError}} is now sending event to the main dispatcher which locks RM and does the transitionToStandBy, it could just do it synchronously. I think the main point is to get rid of {{ResourceManager#RMFatalEventDispatcher}}. Also, RMStateStore could create a separate thread to do the transition. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197621#comment-14197621 ] Rohith commented on YARN-2579: -- Thanks Vinod!! Am trying to understand the proposal, bq. 1.The main event dispatcher should be limited to handle events coming from active service. That way none of those events lock the resourcemanager itself Currenlty also, maiin event dispatcher is hadling events from active services. Am I missing anything? > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197409#comment-14197409 ] Vinod Kumar Vavilapalli commented on YARN-2579: --- Just got a summary of this from [~jianhe]. I think the fundamental problem is the main event dispatcher handling events (RMFatalEventType) that can take a lock on ResourceManager I propose the following # The main event dispatcher should be limited to handle events coming from active service. That way none of those events lock the resourcemanager itself. # State Store and Embedded elector DO NOT use the dispatcher to transition RM (This is because Dispatcher itself is an active service). ## Embedded elector can always synchronously transition RM state ## State store can spawn a separate thread to transition RM state. We can take a short-cut by transitioning RM state inside the StateStore's dispatcher itself, but eventually that event will try to close the StateStore - so we should avoid this. # StateStore sending out a fatal event and then proceeding ahead to do more state-store writes doesn't make sense. Once the StateStore sees a fatal event, it should go into a RMStateStoreState.SHUTDOWN state and stop processing any more events. We can do (3) in a separate patch to reduce scope. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189583#comment-14189583 ] Rohith commented on YARN-2579: -- Thanks Karthink!! bq. (Service)Dispatcher.stop() wait for draining out RMFatalEventDispatcher event I was meant to say that drained event i.e RMFatalEvent is been waiting to be finished at {{rmDispatcher.stop()}} in {{eventHandlerThread.join}}. bq. {{dispatch(event)}} in AsyncDispatcher#createThread doesn't have a try-catch block {{dispatch(event)}} method catch throwable and exit the JVM. But I see if handler's are not registered , then we must have try-catch block. do you meant for this scenario? bq. {{eventHandlerThread.join}} in serviceStop should take a timeout as well +1 for this approach too, this also fixes hang problem. The attached patch too does not bring Rm to hang in a kind of deadlock mode. bq. With the current patch, I wonder if there are any unexpected side-effects I have verified many switching scenarios as I mentioned in previous comment and more deployed in real cluster. It is working fine with work preserving restart too. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith >Priority: Blocker > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189195#comment-14189195 ] Karthik Kambatla commented on YARN-2579: Thanks, [~rohithsharma]. Looking at the tests and your explanation, I think I see what you are saying. However, looking into the code, I am not convinced it is draining out that is causing this issue. {{rmDispatcher}} is an {{AsyncDispatcher}}, with {{drainEventsOnStop}} always false. So, {{rmDispatcher.stop()}} shouldn't lead to any draining of events. I noticed a couple of other issues in the AsyncDispatcher code: # {{eventHandlerThread.join}} in serviceStop should take a timeout as well # {{dispatch(event)}} in AsyncDispatcher#createThread doesn't have a try-catch block With the current patch, I wonder if there are any unexpected side-effects. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177886#comment-14177886 ] Rohith commented on YARN-2579: -- bq. Under what conditions, can resetDispatcher be called by two threads simultaneously? resetDispatcher is called only once in synchronized block(transitionToStandBy or transitinedToActive). Here the problem is , *Thread-1 :* just before stoppingActiveServices() from trainsitionToStandBy() method if RMFatalEvent is thrown then RMFatalEventDispatcher wait for trainsitionToStandBy() for obtaining lock.RMFatalEventDispatcher is BLOCKED on trainsitionToStandBy(). *Thread-2 :* From the elector, trainsitionedTotandBy() stops dispatcher in resetDispatcher() method. (Service)Dispatcher.stop() wait for draining out RMFatalEventDispatcher event.But "AsyncDispatcher event handler" is WAITING on dispatcher thread to finish. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14177003#comment-14177003 ] Karthik Kambatla commented on YARN-2579: [~rohithsharma] - can you help me understand the issue here better. {{resetDispatcher}} is called either in transitionToStandby and transitionToActive, both of which are synchronized methods. Under what conditions, can {{resetDispatcher}} be called by two threads simultaneously? > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173273#comment-14173273 ] Rohith commented on YARN-2579: -- Hi [~vinodkv], [~kasha], [~jianhe] Can this issue fix goes into release 2.6.0 please? If it appears,then one of the RM will be in hanged state. It is kind of deadlock between 2 threads where jstack does not show as deadlock. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172442#comment-14172442 ] Hadoop QA commented on YARN-2579: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674998/YARN-2579.patch against trunk revision 128ace1. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1267 javac compiler warnings (more than the trunk's current 1266 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5406//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5406//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5406//console This message is automatically generated. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172369#comment-14172369 ] Rohith commented on YARN-2579: -- I updated the patch with test that simulates transitionToStandBy causes RM to hang in specific flow of events from ZK. Please review the patch. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch, YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163057#comment-14163057 ] Hadoop QA commented on YARN-2579: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673338/YARN-2579.patch against trunk revision 1efd9c9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5320//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5320//console This message is automatically generated. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161781#comment-14161781 ] Rohith commented on YARN-2579: -- Verified manually for below tests using help of eclipse debug point. 1. Call transitionToStandBy from admin service obtaining RM lock, and at same time RMFatalEventDispatcher wait for RM lock to transtionToStandBy(This issue scenario) 2. Call transitionToStandBy from RMFatalEventDispatcher obtaining RM lock, and at same time admin service wait for RM lock to transtionToStandBy. Please review patch. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161774#comment-14161774 ] Rohith commented on YARN-2579: -- Considering 1 st approach as feasible, I attached patch. Thinking of how do I write tests!! > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > Attachments: YARN-2579.patch > > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144550#comment-14144550 ] Rohith commented on YARN-2579: -- For fixing this, approaches I can think of are 1. we can call ((Service) rmDispatcher).stop(); in separate thread, so current lock on transitionToStandby() will be released and RMFatalEventDispatcher holds the lock. By this time, RM state already in standby state. 2. Instead of resetting new async diapatcher, let maintain single dispatcher for period of jvm life. There should mechanism for clearing queued events in diapatcher, so dispatcher should not process . 3. Set separate dispatcher thread for all RMStateStore events. Please suggest your opinion this bug fix. > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith >Assignee: Rohith > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2579) Both RM's state is Active , but 1 RM is not really active.
[ https://issues.apache.org/jira/browse/YARN-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143200#comment-14143200 ] Rohith commented on YARN-2579: -- This scenario could ocure if 2 thread trying to access ResourceManager#transitionToStandby().One is from AdminService#trainsitiontostandby first and then RMFatalEventDispatcher#transitionToStandBy(). This I simulated using debug point. The main problem is in resetting dispatcher, stops the dispatcher. Suppose, if AdminService is stopping dispatcher but dispatcher thread is blocked for getting acquire lock on ResourceManager, then ResourceManager never get transitioned to StandBy. It wait infinitely. {code} "AsyncDispatcher event handler" daemon prio=10 tid=0x007ea000 nid=0x39d1 waiting for monitor entry [0x7fe0a77f6000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:976) - waiting to lock <0xc1f7d438> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:701) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:678) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) "IPC Server handler 0 on 45021" daemon prio=10 tid=0x7fe0a9026800 nid=0x30ab in Object.wait() [0x7fe0a7cfa000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0xeb3310e8> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1281) - locked <0xeb3310e8> (a java.lang.Thread) at java.lang.Thread.join(Thread.java:1355) at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:150) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0xeb32fef8> (a java.lang.Object) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetDispatcher(ResourceManager.java:1166) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:987) - locked <0xc1f7d438> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:308) - locked <0xc2038d10> (a org.apache.hadoop.yarn.server.resourcemanager.AdminService) at org.apache.hadoop.ha.protocolPB.HAServiceProtocolServerSideTranslatorPB.transitionToStandby(HAServiceProtocolServerSideTranslatorPB.java:119) at org.apache.hadoop.ha.proto.HAServiceProtocolProtos$HAServiceProtocolService$2.callBlockingMethod(HAServiceProtocolProtos.java:4462) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2013) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2009) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1557) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2007) {code} > Both RM's state is Active , but 1 RM is not really active. > -- > > Key: YARN-2579 > URL: https://issues.apache.org/jira/browse/YARN-2579 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager >Affects Versions: 2.5.1 >Reporter: Rohith > > I encountered a situaltion where both RM's web page was able to access and > its state displayed as Active. But One of the RM's ActiveServices were > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)