[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086403#comment-15086403 ] Hudson commented on YARN-3893: -- FAILURE: Integrated in Hadoop-trunk-Commit #9060 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/9060/]) Add YARN-2975, YARN-3893, YARN-2902 and YARN-4354 to Release 2.6.4 entry (junping_du: rev b6c9d3fab9c76b03abd664858f64a4ebf3c2bb20) * hadoop-yarn-project/CHANGES.txt > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15081150#comment-15081150 ] Junping Du commented on YARN-3893: -- Thanks [~rohithsharma]! > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080545#comment-15080545 ] Junping Du commented on YARN-3893: -- Hi [~bibinchundatt], [~rohithsharma] and [~xgong], is this bug also valid on branch-2.6? If so, may be we should consider to backport to branch-2.6? > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080685#comment-15080685 ] Rohith Sharma K S commented on YARN-3893: - I think it should be there in 2.6 too, let me cross confirm it. If exist, I will backport this to 2.6 > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15080740#comment-15080740 ] Rohith Sharma K S commented on YARN-3893: - This issue is valid for branch-2.6. I have back ported to 2.6.4 > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2, 2.6.4 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727187#comment-14727187 ] Hudson commented on YARN-3893: -- SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #342 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/342/]) YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727185#comment-14727185 ] Hudson commented on YARN-3893: -- FAILURE: Integrated in Hadoop-Yarn-trunk #1070 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/1070/]) YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727346#comment-14727346 ] Hudson commented on YARN-3893: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk-Java8 #335 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/335/]) YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727115#comment-14727115 ] Varun Saxena commented on YARN-3893: +1...lgtm too > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727134#comment-14727134 ] Hudson commented on YARN-3893: -- FAILURE: Integrated in Hadoop-trunk-Commit #8387 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/8387/]) YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727444#comment-14727444 ] Hudson commented on YARN-3893: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2284 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2284/]) YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727397#comment-14727397 ] Hudson commented on YARN-3893: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2264 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2264/]) YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14727419#comment-14727419 ] Hudson commented on YARN-3893: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #325 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/325/]) YARN-3893. Both RM in active state when Admin#transitionToActive failure from refeshAll() (Bibin A Chundatt via rohithsharmaks) (rohithsharmaks: rev 7d6687fe76f6152a577ff2298c358dd30fce41fb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMHA.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/AdminService.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/RMFatalEventType.java * hadoop-yarn-project/CHANGES.txt > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Fix For: 2.7.2 > > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723297#comment-14723297 ] Rohith Sharma K S commented on YARN-3893: - +1 lgtm.. Will commit it tomorrow if there is no objections/comments from other folks.. > Both RM in active state when Admin#transitionToActive failure from refeshAll() > -- > > Key: YARN-3893 > URL: https://issues.apache.org/jira/browse/YARN-3893 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager >Affects Versions: 2.7.1 >Reporter: Bibin A Chundatt >Assignee: Bibin A Chundatt >Priority: Critical > Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, > 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, > 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, > 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml > > > Cases that can cause this. > # Capacity scheduler xml is wrongly configured during switch > # Refresh ACL failure due to configuration > # Refresh User group failure due to configuration > Continuously both RM will try to be active > {code} > dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm1 > 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin> > ./yarn rmadmin -getServiceState rm2 > 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > active > {code} > # Both Web UI active > # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716784#comment-14716784 ] Bibin A Chundatt commented on YARN-3893: Testcase failures are not related to this patch. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716779#comment-14716779 ] Hadoop QA commented on YARN-3893: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 26s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 2s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 51s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 32s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 53m 42s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 92m 44s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | | | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesHttpStaticUserPermissions | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12752740/0010-YARN-3893.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0bf2854 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8929/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8929/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8929/console | This message was automatically generated. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716642#comment-14716642 ] Naganarasimha G R commented on YARN-3893: - Hi [~bibinchundatt] 2 there are test cases related to transition in TestRMAdminService.testRMHAWithFileSystemBasedConfiguration but most of it is present in TestRMHA so i think it should be fine. 3 Well IMHO it would be better be handled in the later approach i suggested, as {{refreshAll}} is just a private method but actual operation is transistionToActive which Failed which is more readable than {{ACTIVE_REFRESH_FAIL}} Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716645#comment-14716645 ] Naganarasimha G R commented on YARN-3893: - Oops saw this message late ! Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716644#comment-14716644 ] Naganarasimha G R commented on YARN-3893: - Oops saw this message late ! Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, 0009-YARN-3893.patch, 0010-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716543#comment-14716543 ] Naganarasimha G R commented on YARN-3893: - Hi [~bibinchundatt], Thanks for the patch, test cases ran fine, approach and test case seems to be fine but few comments from my side # timeout of 90 is on the higher side is that much req or was it for local testing ? # instead of test case in RMHA can we think of adding it to TestRMAdminService as the failure is related to transition to Active ? # May be while throwing RMFatalEvent better to wrap it with another exception wrapping the existing one and with the message that transition to active failed so that RM Logs have clear information on what operation it exited. or may be eventType instead of having {{ACTIVE_REFRESH_FAIL}} we can have more intuitive name {{TRANSITION_TO_ACTIVE_FAILED}} Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14716563#comment-14716563 ] Bibin A Chundatt commented on YARN-3893: Hi Naga Thnks for looking into patch {quote} timeout of 90 is on the higher side is that much req or was it for local testing ? {quote} will update the same. {quote} instead of test case in RMHA can we think of adding it to TestRMAdminService as the failure is related to transition to Active ? {quote} As i understand all transistiontoActive HA related testcases are added in same class. 3.{{TRANSITION_TO_ACTIVE_FAILED}} is not actually failing its {{refreshAll}} rt? Thts the reason it gave specific name. Points 2 and 3 are not mandatory fix items rt? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712619#comment-14712619 ] Varun Saxena commented on YARN-3893: I do not have any concern for exiting JVM. If fail fast is true(default behavior), JVM will exit anyways. I was wondering if it would be semantically appropriate to make JVM exit in some cases if somebody has explicitly changed the fail fast config to false. Logs can fill up if yarn-site.xml is wrong on both RMs' too. I am not sure about the webapp part though. Does it require client rm service to be initialized ? AFAIK, if RM is standby it will hit the webapp filter and redirect to other RM(which may be active). Haven't tested UI after applying previous patches, so maybe Bibin can tell. If there are some issues with webapp, we will have to exit the JVM if transition to standby fails. Because there may be no other way out then. I will discuss further on this with you offline. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712637#comment-14712637 ] Varun Saxena commented on YARN-3893: Infact according to me, we can crash RM on all times if config is wrong. Because till config is corrected, the RM where config is wrong cannot become active(and hence will be unusable). In that case, fail fast config wont even be required. So should we change the behavior to keep RM in standby(but up) if fail fast is set to false ? Anyways can discuss more in detail face to face. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712896#comment-14712896 ] Varun Saxena commented on YARN-3893: The latest patch, 0008-YARN-3893.patch LGTM. +1 pending Jenkins. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712745#comment-14712745 ] Sunil G commented on YARN-3893: --- As I see this, JVM exit is reasonable as proposed by Rohith earlier. Because scheduler configurations are wrong mostly, and its not required to switch to standby or fail-fast etc. Directly if we can exit JVM, it will be clean and there will be enough information available in logs to analyze for config fail reasons. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712942#comment-14712942 ] Hadoop QA commented on YARN-3893: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 15s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 41s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 59s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 50s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 27s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 32s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 52m 9s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 90m 50s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12752428/0006-YARN-3893.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a4d9acc | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8914/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8914/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8914/console | This message was automatically generated. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14713001#comment-14713001 ] Bibin A Chundatt commented on YARN-3893: Test failures are not related to this patch. Have looked into the failed testcases {{hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens}} - Due Bind exception {{hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService}} - Locally verified its working fine and success {{hadoop.yarn.server.resourcemanager.TestClientRMService}} -Ran locally in eclipse its working fine Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712974#comment-14712974 ] Hadoop QA commented on YARN-3893: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 32s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 46s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 48s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 51s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 29s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 29s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 53m 19s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 92m 14s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.security.TestClientToAMTokens | | | hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService | | | hadoop.yarn.server.resourcemanager.TestClientRMService | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12752434/0007-YARN-3893.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a4d9acc | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8915/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8915/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8915/console | This message was automatically generated. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14713002#comment-14713002 ] Bibin A Chundatt commented on YARN-3893: Above comments are for https://builds.apache.org/job/PreCommit-YARN-Build/8915/testReport/ Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712999#comment-14712999 ] Hadoop QA commented on YARN-3893: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 43s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 8s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 51s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 31s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 53m 39s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 93m 18s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12752437/0008-YARN-3893.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / a4d9acc | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8916/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8916/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf908.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8916/console | This message was automatically generated. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, 0006-YARN-3893.patch, 0007-YARN-3893.patch, 0008-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712767#comment-14712767 ] Varun Saxena commented on YARN-3893: Yes I agree. We can exit JVM directly. No need of using fail fast. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711290#comment-14711290 ] Varun Saxena commented on YARN-3893: Saw your comments above. We cant do what we were doing earlier because as you say WebApp should be up even in standby. Let me think if something else can be done. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710942#comment-14710942 ] Rohith Sharma K S commented on YARN-3893: - Thanks [~bibinchundatt] for updating the patch. The patch mostly reasonable!! Some comments on the patch # Does {{isRMActive() }} check is required..? If transitionedToActive is success only then refreshAll will be executed!! IAC if you add also then check should be common for both i.e *_if_else* # In the Test, below code expecting transitionToActive to be failed? Is so, then it RM state shoud not be in Active state. Why RM will be in Active if adminService fails to transition? {code} +try { + rm.adminService.transitionToActive(requestInfo); +} catch (Exception e) { + assertTrue(Error when transitioning to Active mode.contains(e + .getMessage())); +} +assertEquals(HAServiceState.ACTIVE, rm.getRMContext().getHAServiceState()); {code} # Have you verified the test locally? I have doubt that test may be exitted in the middle since you are changing the scheduler configuration. Scheduler configuration is loaded during transitionedToStandby which fails to load and *System.exit* is called. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711036#comment-14711036 ] Varun Saxena commented on YARN-3893: Few additional comments : * Below exception block i.e. exception block after call to refreshAll, if {{YarnConfiguration.shouldRMFailFast(getConfig())}} is true, we merely post fatal event and do not return or throw an exception. This would lead to success audit log for transition to active being printed, which doesn't quite look correct. Because we are encountering some problem during call to transition. We should either return or throw a ServiceFailedException here as well. Although both are OK because RM would anyways be down later but I would prefer exception. {code} 324 } catch (Exception e) { 325 if (isRMActive() YarnConfiguration.shouldRMFailFast(getConfig())) { 326 rmContext.getDispatcher().getEventHandler() 327 .handle(new RMFatalEvent(RMFatalEventType.ACTIVE_REFRESH_FAIL, e)); 328 }else{ 329 rm.handleTransitionToStandBy(); 330 throw new ServiceFailedException( 331 Error on refreshAll during transistion to Active, e); 332 } 333 } 334 RMAuditLogger.logSuccess(user.getShortUserName(), transitionToActive, 335 RMHAProtocolService); 336 } {code} * In TestRMHA, below import is unused. {code} import io.netty.channel.MessageSizeEstimator.Handle; {code} * A nit : There should be a space before else. {code} 328 }else{ 329 rm.handleTransitionToStandBy(); {code} * In the test added, assert is not required in the exception block after first call to transitionToActive * Maybe we can add an assert in test for service state being STANDBY after call to transitionToActive with incorrect capacity scheduler config and fail-fast being false. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710955#comment-14710955 ] Rohith Sharma K S commented on YARN-3893: - To be more clear on the 3rd point, {{handleTransitionToStandBy}} call will exit if transitionToStandby fails. This transition may fail because during transition, active services are initialized. CS initialization loads the new capacity-schduler conf which result in wrong default queue capacity value result standby transition failure. 4. Instead of having separate class FatalEventCountDispatcher , can it be made inline? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711164#comment-14711164 ] Varun Saxena commented on YARN-3893: Moreover, the fail fast configuration doesnt quite work as expected here. If capacity scheduler configuration is wrong, initialization will again fail and JVM will exit, which in essence is exactly same as the other case. We can handle fail fast as true case same way as earlier IMO. The reason it works in the test(JVM does not exit) is that you have passed CapacitySchedulerConfiguration object to MockRM. As CapacitySchedulerConfiguration is not instanceof YarnConfiguration, this will lead to a new YarnConfiguration object being created and passed to ResourceManager. When you are changing configuration in test and set queue capacity to 200, it is not reflecting in the Configuration object in ResourceManager class. That is why JVM does not exit when we transition to standby. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711171#comment-14711171 ] Varun Saxena commented on YARN-3893: Sorry I meant we can handle fail fast config being *false* case same way as we were doing in earlier patches. Otherwise checking for fail fast doesnt make any difference because both the code paths lead to same result. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711201#comment-14711201 ] Rohith Sharma K S commented on YARN-3893: - There are 2 type of refresh can happen i.e. 1. yarn-site.xml refresh, 2. scheduler configurations refresh. Schduler configurations are reloaded for every service initialization which is by design. If any issue in the scheduler configuration, fail-fast configuraton behavior work as same for both true and false. Fail-fast configuration is useful when admin do mistake in configuring mistake in yarn-site.xml. With wrong configuration in yarn-site.xml, RM service can be up whereas with wrong Scheduler configuration , service can NOT be up at all. *On best effort basis for make service up*, handling exception for yarn-site.xml and scheduler configuration are different. BTW, making RM state StandBy would lead to filling up of the logs very soon because of elector continuous try to make active. Any configuration issue, better to exit the JVM and notify admin that RM is down so that admin can check the logs and identify it. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711274#comment-14711274 ] Varun Saxena commented on YARN-3893: Hmm...my point of view based on the fact that the service cannot be up if atleast one RM is not active. Standby RM is not going to serve anything anyways. Till configurations of this RM are not corrected, whether yarn-site or scheduler configurations, this RM anyways cant become active (refreshAll will always fail). And you can say there might be some silly mistake in scheduler configuration too. What we were doing before in the patch wont fill up the logs if configuration is ok on other RM. And if its not Ok on other RM, logs will fill up even even if refreshAll fails because of something other than scheduler config(and fail fast is false). fail fast by default is true, and if admin is making it false, he will know what to expect. But, you can say a RM shutting down is a far more alarming thing for an admin and scheduler configurations more important. I agree with that. Maybe we can make RM with wrong configuration down at all times. Because till he correct the config(whether yarn-site or scheduler config), this RM cant become active. Let us take opinion of couple of others as well on this. We can do whatever is the consensus. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711278#comment-14711278 ] Varun Saxena commented on YARN-3893: In previous patches, we were delaying reinitialization till attempting transition to active again and not attempting it immediately as we have done here. Any issues you expect with that ? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14712308#comment-14712308 ] Rohith Sharma K S commented on YARN-3893: - Hi [~varun_saxena], trying to understand your point of statement, my suggestion is to exit the RM if any configuration issue during refreshAll during {{AdminService#transitionToActive}}. As I given reason for making RM JVM down rather than keeping JVM alive in earlier [comment|https://issues.apache.org/jira/browse/YARN-3893?focusedCommentId=14711201page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14711201], do you have any concern for exiting the RM for configuration issues? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, 0005-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708470#comment-14708470 ] Rohith Sharma K S commented on YARN-3893: - I had closer look at either of the solutions as above. One of the potential issue in both are # Moving createAndInitService just before starting activeServices in transitionToActive. ## switch time will be impacted since every transitionToActive initializes active services. ## And RMWebApp has dependency on clienRMService for starting webapps. Without clientRMService initialization, RMWebapp can not be started. # Moving refreshAll before transitionToActive in adminService is same as triggering RMAdminCIi on standby node. This call throw StandByException and retried to active RM in RMAdminCli. When it comes to AdminService#transitionedToActive(), refreshing before {{rm.transitionedToActive}} throws an standby exception. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708471#comment-14708471 ] Rohith Sharma K S commented on YARN-3893: - I think for any configuration issues while transitioningToActive, Adminservice should not allow JVM to continue. Because if AdminService throws any exception back to elector, elector again try to make RM active which goes in loop forever filling the logs. There could be 2 calls can lead to point of failures i.e first {{rm.transitionedToActive}}, second {{refreshAll()}}. # If any failures in {{rm.transitionedToActive}} then RM services will be stopped and RM will be in STANDBY state. # If {{refreshAll()}} fails, BOTH RM will be in ACTIVE state as per this defect. Continuing RM services with invalid configuration does not good idea. Moreover invalid configurations should be notified to user immediately. So it would be better to make use of fail-fast configuration to exit the RM JVM. If this configuration is set to false , then call {{rm.handleTransitionToStandBy}}. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708425#comment-14708425 ] Bibin A Chundatt commented on YARN-3893: Hi [~rohithsharma] Any comments on this.? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14708491#comment-14708491 ] Varun Saxena commented on YARN-3893: Using fail fast makes sense. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14703197#comment-14703197 ] Sunil G commented on YARN-3893: --- Hi [~rohithsharma] Thank you for restarting this thread. The idea of calling {{createAndInitActiveServices}} from both ResourceManager#transitionToActive() and transistionToStandby is good . In this case, we can remove the call to {{refreshAll}} from AdminService#transistionToStandby. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.7.1 Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14701416#comment-14701416 ] Bibin A Chundatt commented on YARN-3893: Hi [~rohithsharma] Thank you for your review comments Will update the same and upload patch soon. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14698908#comment-14698908 ] Rohith Sharma K S commented on YARN-3893: - Sorry for coming very late.. This issue has become stale, need to move forward!! Regarding the patch, # Instead of setting boolean flag for reinitActiveServices in AdminService and other changes, moving {{createAndInitActiveServices();}} from transitionedToStandby to just before starting activeServices would solve such issues. And on exception transitioningToActive, handle add method stopActiveServices in ResourceManager#transitioningToActive() only. # Probably we can remove refreshAll() from AdminService#transitioneToActive if the above approach. Any thoughts? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14641743#comment-14641743 ] Bibin A Chundatt commented on YARN-3893: {quote} Instead of checking for exception message in test, can you check for ServiceFailedException {quote} Already the same is verified in may testcases using messages. {quote} Can you add a verification in the test to check whether active services were stopped ? {quote} IMO its not required. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14632793#comment-14632793 ] Varun Saxena commented on YARN-3893: [~bibinchundatt] # Instead of checking for exception message in test, can you check for {{ServiceFailedException}} ? # Can you add a verification in the test to check whether active services were stopped ? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627767#comment-14627767 ] Hadoop QA commented on YARN-3893: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 40s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 8m 52s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 1s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 57s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 0s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 31s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 36s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 44s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 51m 58s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 95m 44s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745400/0003-YARN-3893.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / edcaae4 | | whitespace | https://builds.apache.org/job/PreCommit-YARN-Build/8540/artifact/patchprocess/whitespace.txt | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8540/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8540/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8540/console | This message was automatically generated. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627950#comment-14627950 ] Varun Saxena commented on YARN-3893: Thanks for the patch [~bibinchundatt]. Few comments. # Nit : Should be Exception in state transition {code} throw new ServiceFailedException( Exception in state transistion, re); {code} # IMO, no need to throw ServiceFailedException when catching exception while calling reinitialize. The throw below should suffice. Just set the flag. According to me, we should retain the original exception. # Add a comment indicating what the flag does. # Maybe rename the flag to reinitActiveServices instead of reinitialize. # The flag according to me, semantically speaking, doesn't quite belong to AdminService. Can be in ResourceManager or RMContext. Thoughts ? # Can you add a test to verify the fix ? # I think instead of relying on transitionToStandby to change state to standby, we can explicitly change the state in AdminService. Thats because even stopActiveServices can throw an Exception and if it does, state won't change to STANDBY. This call to stop should not throw an exception, but as services keep on getting added you never know how a particular service may behave. We should be immune to it. Try something like below. {code} ((RMContextImpl)rmContext).setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY); {code} # Just a suggestion. If we do above, maybe call stopActiveServices and reinitialize directly instead of calling transitonToStandby. This is because as I said in a comment above, transitionToStandby would print an audit log saying transition is successful. But reinitialize subsequently may fail. And not printing this audit log will be consistent with transitionToActive failing during starting active services. Thoughts ? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628299#comment-14628299 ] Hadoop QA commented on YARN-3893: - \\ \\ | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 30s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 1 new or modified test files. | | {color:green}+1{color} | javac | 8m 51s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 50s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 22s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 54s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 29s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 40s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | yarn tests | 52m 6s | Tests passed in hadoop-yarn-server-resourcemanager. | | | | 95m 19s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745452/0004-YARN-3893.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / edcaae4 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8547/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8547/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf907.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8547/console | This message was automatically generated. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, 0003-YARN-3893.patch, 0004-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626438#comment-14626438 ] Sunil G commented on YARN-3893: --- Hi [~varun_saxena] bq.Reinitialization of Active Services is required. For this, I think calling {{rm.transitionToStandby(true)}} is not a good idea. Because same exception can come while initializing CapacityScheduler (cs config file). Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626437#comment-14626437 ] Sunil G commented on YARN-3893: --- Hi [~varun_saxena] bq.Reinitialization of Active Services is required. For this, I think calling {{rm.transitionToStandby(true)}} is not a good idea. Because same exception can come while initializing CapacityScheduler (cs config file). Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626664#comment-14626664 ] Xuan Gong commented on YARN-3893: - Sorry, should call stopAndReinitiate(). Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626688#comment-14626688 ] Varun Saxena commented on YARN-3893: Yes but we do need to reinitialize services. Otherwise transition to active when everything is fine will not happen. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626661#comment-14626661 ] Xuan Gong commented on YARN-3893: - How about first calling rm.transitionToStandby(false), then call activeService.reinitiate() (probably need to create this function) ? At least, the RM will transit to Standby. Even if the reinitiate() throws exception, the leader elector will handle this. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626677#comment-14626677 ] Sunil G commented on YARN-3893: --- Hi [~xgong] Yes, we can do that. But I feel now we call {{rmContext.setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY);}} as last statement in {{transitionToStandby}}. SO if an exception happens in {{reinitialize}} code flow wont reach to set the state as Standby. So we may also need to set the state in context as Standby. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626694#comment-14626694 ] Varun Saxena commented on YARN-3893: We do need to stop active services because many threads would be spawned on attempt to transition to active. Frankly, we can have a additional flag in RM indicating that reinitialization of services is required and attempt them while trying for transition to active. We can stop the services beforehand because no point having some threads running in standby. Thoughts ? We can do something like below {code} // Exception was thrown in call to refreshAll. if (rmContext.getHAServiceState() == HAServiceProtocol.HAServiceState.ACTIVE) { ((RMContextImpl)rmContext).setHAServiceState(HAServiceProtocol.HAServiceState.STANDBY); try { rm.stopActiveServices(); // set a flag in RM(maybe rm context) indicating reinit of services is required on trying for transition to active despite state being standby. } catch (Exception ex) { } {code} Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626909#comment-14626909 ] Varun Saxena commented on YARN-3893: [~sunilg], even I was suggesting earlier in my comment that we reinit only while transitioning to active. But then I thought that if we reinit on standby and there is a problem in initialization, a failure can indicate the admin to correct his config. An audit log will be printed. If we do not reinit, a success audit log on transition to standby would be printed, which may indicate no problem in config to admin. Thoughts ? We can procrastinate reiniting till transition to active as well. But its better to indicate a failure even on standby IMHO. I do not see any harm in it. I am fine either ways because reiniting really matters when transitioning to active. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626911#comment-14626911 ] Varun Saxena commented on YARN-3893: I am fine either ways though because as you said reiniting really matters when transitioning to active. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14627550#comment-14627550 ] Hadoop QA commented on YARN-3893: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 16m 8s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:red}-1{color} | tests included | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | | {color:green}+1{color} | javac | 7m 42s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 9m 39s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 0m 46s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 0s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 20s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 1m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | yarn tests | 51m 13s | Tests failed in hadoop-yarn-server-resourcemanager. | | | | 89m 13s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.yarn.server.resourcemanager.TestRMHA | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12745382/0002-YARN-3893.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0a16ee6 | | hadoop-yarn-server-resourcemanager test log | https://builds.apache.org/job/PreCommit-YARN-Build/8539/artifact/patchprocess/testrun_hadoop-yarn-server-resourcemanager.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/8539/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf905.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/8539/console | This message was automatically generated. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: 0001-YARN-3893.patch, 0002-YARN-3893.patch, yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626833#comment-14626833 ] Varun Saxena commented on YARN-3893: Ok, lets add a flag. According to me, we need to check this flag and do reinit even on transitionToStandby even though state is standby. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626826#comment-14626826 ] Xuan Gong commented on YARN-3893: - Thanks for [~varun_saxena] and [~sunilg]. I am fine with adding a new internal state although it might be too complex. But if we could handle this correctly, I am fine with this. To this specific issue, I think that at least two things we should do here: 1) stop All ActiveService 2) transit to standby. (basically, set RM state in RMContext as Standby) But, we also need to reinitiate all the active service to prepare for the transitToActive call. At least, we should do: {code} rm.transitToStandy(false); reinitiateActiveService(); {code} Here the reinitiateActiveService() can throw out the same exception. And I can see why this does not solve the whole problem. How about we introduce a new atomicBoolean flag to track whether we need to reinitiate active service ? And we could add following into transitToActive logic {code} if (reinitiateRequired) reinitiateActiveService() {code} before we start all the active service. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626873#comment-14626873 ] Sunil G commented on YARN-3893: --- +1 for using atomicBoolean flag. Do we really need to call {{reinitiateActiveService}} from {{transitionToStandby}}. I think it can be done while we invoke {{transitionToActive}} when it matters. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626716#comment-14626716 ] Sunil G commented on YARN-3893: --- I remember an earlier suggestion of new HAServiceState. Introducing a state as WAITING_FOR_ACTIVE may help to do all reinit or other inits when we try to move to ACTIVE. Also as mentioned earlier, this can be hidden state internally. It may look more cleaner than flag. So along with above solution, could we add this new state also? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626725#comment-14626725 ] Varun Saxena commented on YARN-3893: [~xgong], issue with reinitialization is that if exception is thrown during initialization then all the active services will be stopped. And when we transitiontoactive we will directly attempt to start active services which would fail because services are in state STOPPED. I think we can forcibly set the state to standby and set a flag in RMContext indicating reinit is required whenever attempting transition to standby or active. This way we will let leader election handle the exception. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626728#comment-14626728 ] Varun Saxena commented on YARN-3893: Yeah [~sunilg], a new state can also be introduced as I suggested. This would act similar to a flag. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14626741#comment-14626741 ] Varun Saxena commented on YARN-3893: [~xgong], thoughts about introducing a new internal state ? We have to be very careful though when we do this as this is a very sensitive piece of code. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625866#comment-14625866 ] Varun Saxena commented on YARN-3893: *Reinitialization of Active Services is required*. When you call stop active services, service state for all the services will change to STOPPED. If this RM were to become active again, we will try to start all the active services and services cant transition to START state from STOPPED state. They can only do so when services are in INIT state. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625304#comment-14625304 ] Bibin A Chundatt commented on YARN-3893: [~varun_saxena] and [~sunilg] . Only need to call {{rm.transitionToStandby(false)}} on exception . Since it handles transition to standby in rm context,Stop active services and not reinitializing queues. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624246#comment-14624246 ] Bibin A Chundatt commented on YARN-3893: Thanks [~varun_saxena] and [~sunilg] first option look good and easier to implement. But both RM could be in standBy state. but looks like the best option. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624156#comment-14624156 ] Sunil G commented on YARN-3893: --- Thanks [~varun_saxena] for sharing detailed analysis. Infact we must change the state in context. IMO, I feel we can stop active services, and move the RM state to Standby. With this, RM will become another candidate for election. If any case when the same RM is selected as active, and if we have good config, then with existing call flow *startActiveServices* will be invoked. So it should be fine in that case. From UI also, both RM will be shown as Standby too. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624233#comment-14624233 ] Varun Saxena commented on YARN-3893: Yeah lets go with first option I suggested then i.e. make RM Context as standby and stop active services followed by initialization. That will be easier to implement. This will resolve the issue. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623502#comment-14623502 ] Sunil G commented on YARN-3893: --- refreshAll() is doing many set of refresh operations. And exception may come from any state. Its better to gracefully close those. So setting state directly wont help much, we may need to go through part of transitionToStandby. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623567#comment-14623567 ] Varun Saxena commented on YARN-3893: [~sunilg] We can do the cleanup(i.e. stop active services) when we switch to standby. We do this already. Also cleanup will be done when we stop RM. So this shouldn't be an issue. What is happening is as under : Let us assume there is RM1 and RM2. Basically, when exception occurs, RM1 waits for RM2 to become active and joins leader election again. As both RMs' have wrong configuration, RM1 will try to become active again(and not switch to standby) after RM2 has tried the same. Now, as the problem is in call to {{refreshAll}}, both RMs' would be marked as ACTIVE in their respective RM Contexts. Because we set it to ACTIVE before calling refreshAll. *The problem reported here is that RM is shown as Active when it is not actually ACTIVE i.e. UI is accessible and getServiceState returns both RM as Active. And when we access UI or get service state we check what's the state in RM Context. And that is ACTIVE.* So for anyone who is accessing RM from command line or via UI, RM is active(*because RM context says so*), when it is not really active. Both RMs' are just trying incessantly to become active and failing. That is why I suggested that we can update the RM Context. Infact changing RM context is necessary. We can decide when to stop active services, if at all. So there are 2 options : # We can set RM context to standby when exception occurs and stop active services. But if we do it, this would mean we will have to redo the work of starting active services again if this RM were to become ACTIVE. # Introduce a new state (say WAITING_FOR_ACTIVE) and set this state when exception is thrown and check this state to stop active services when switching to standby. And not starting the services again in case of switching to ACTIVE. Thoughts, [~sunilg], [~xgong] ? Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623676#comment-14623676 ] Varun Saxena commented on YARN-3893: For 2nd option, we will have to return STANDBY to client if the state is WAITING_FOR_ACTIVE. So it can primarily be a RM internal state. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621739#comment-14621739 ] Varun Saxena commented on YARN-3893: Maybe set the HA service state in RM context as STANDBY upon throwing the exception. Or not set it to ACTIVE till the all active services are actually started. We primarily check RM context to make the decision about whether RM is in standby state or active. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621703#comment-14621703 ] Xuan Gong commented on YARN-3893: - How about add rm.transitionToStandby(true) before we throw the ServiceFailedException in catch block ? {code} try { rm.transitionToActive(); // call all refresh*s for active RM to get the updated configurations. refreshAll(); RMAuditLogger.logSuccess(user.getShortUserName(), transitionToActive, RMHAProtocolService); } catch (Exception e) { RMAuditLogger.logFailure(user.getShortUserName(), transitionToActive, , RMHAProtocolService, Exception transitioning to active); throw new ServiceFailedException( Error when transitioning to Active mode, e); } {code} In that case, we could transit the RM to standby, and since we throw out the ServiceFailedException, this RM will rejoin the leader election process. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621720#comment-14621720 ] Sunil G commented on YARN-3893: --- Hi [~xgong] Thank you for the update. I have a doubt here. If we call rm.transitionToStandby(true) , then it will result a call to ResourceManager#createAndInitActiveServices(). So is it possible that we may get the same exception which we got from refreshAll call earlier. Specifically queue reinitialize. Currently the CS#serviceInit will call parseQueues. As mentioned here, [~bibinchundatt] used a wrong CS xml file. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause this. # Capacity scheduler xml is wrongly configured during switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Continuously both RM will try to be active {code} dsperf@host-10-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm1 15/07/07 19:08:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active dsperf@host-128:/opt/bibin/dsperf/OPENSOURCE_3_0/install/hadoop/resourcemanager/bin ./yarn rmadmin -getServiceState rm2 15/07/07 19:08:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable active {code} # Both Web UI active # Status shown as active for both RM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618522#comment-14618522 ] Bibin A Chundatt commented on YARN-3893: Sorry typo. Both RM active can happen in many cases. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause failure # Capacity scheduler xml is wrongly configured in switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Capacity failure condition have given logs below Continuously both RM will try to be active {code} 2015-07-07 19:18:25,655 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Initialized queue: default: capacity=0.5, absoluteCapacity=0.5, usedResources=memory:0, vCores:0, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.AdminService: Exception refresh queues. java.io.IOException: Failed to re-init queues at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:383) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:376) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:605) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.lang.IllegalArgumentException: Illegal capacity of 0.5 for children of queue root for label=node2 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setChildQueues(ParentQueue.java:159) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:639) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:503) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:379) ... 8 more 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=refreshQueues TARGET=AdminService RESULT=FAILURE DESCRIPTION=Exception refresh queues. PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:321) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.IOException: Failed to re-init queues at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:617) at
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618605#comment-14618605 ] Bibin A Chundatt commented on YARN-3893: Thnks [~sunilg] for checking the issue. In this jira we should decide how to handle RefreshAll() failure during transistion to active. The configuration mistakes like capacityscheduler.xml , Acl , user group mapping can cause both RM active case during switch due to zk connection error probably. At runtime i am not sure again we will be able to recover once this happens. Capacity scheduler causing this case is one of them. YARN-3894 contains the CS xml. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause failure # Capacity scheduler xml is wrongly configured in switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Capacity failure condition have given logs below Continuously both RM will try to be active {code} 2015-07-07 19:18:25,655 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Initialized queue: default: capacity=0.5, absoluteCapacity=0.5, usedResources=memory:0, vCores:0, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.AdminService: Exception refresh queues. java.io.IOException: Failed to re-init queues at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:383) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:376) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:605) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.lang.IllegalArgumentException: Illegal capacity of 0.5 for children of queue root for label=node2 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setChildQueues(ParentQueue.java:159) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:639) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:503) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:379) ... 8 more 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=refreshQueues TARGET=AdminService RESULT=FAILURE DESCRIPTION=Exception refresh queues. PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at
[jira] [Commented] (YARN-3893) Both RM in active state when Admin#transitionToActive failure from refeshAll()
[ https://issues.apache.org/jira/browse/YARN-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14618590#comment-14618590 ] Sunil G commented on YARN-3893: --- Thank you [~bibinchundatt]. Could you please attach CS xml too. Both RM in active state when Admin#transitionToActive failure from refeshAll() -- Key: YARN-3893 URL: https://issues.apache.org/jira/browse/YARN-3893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Bibin A Chundatt Assignee: Bibin A Chundatt Priority: Critical Attachments: yarn-site.xml Cases that can cause failure # Capacity scheduler xml is wrongly configured in switch # Refresh ACL failure due to configuration # Refresh User group failure due to configuration Capacity failure condition have given logs below Continuously both RM will try to be active {code} 2015-07-07 19:18:25,655 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler: Initialized queue: default: capacity=0.5, absoluteCapacity=0.5, usedResources=memory:0, vCores:0, usedCapacity=0.0, absoluteUsedCapacity=0.0, numApps=0, numContainers=0 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.AdminService: Exception refresh queues. java.io.IOException: Failed to re-init queues at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:383) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshQueues(AdminService.java:376) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:605) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:314) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: java.lang.IllegalArgumentException: Illegal capacity of 0.5 for children of queue root for label=node2 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.setChildQueues(ParentQueue.java:159) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.parseQueue(CapacityScheduler.java:639) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitializeQueues(CapacityScheduler.java:503) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.reinitialize(CapacityScheduler.java:379) ... 8 more 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=refreshQueues TARGET=AdminService RESULT=FAILURE DESCRIPTION=Exception refresh queues. PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=dsperf OPERATION=transitionToActiveTARGET=RMHAProtocolService RESULT=FAILURE DESCRIPTION=Exception transitioning to active PERMISSIONS= 2015-07-07 19:18:25,656 WARN org.apache.hadoop.ha.ActiveStandbyElector: Exception handling the winning of election org.apache.hadoop.ha.ServiceFailedException: RM could not transition to Active at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:128) at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:824) at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:420) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:599) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) Caused by: org.apache.hadoop.ha.ServiceFailedException: Error when transitioning to Active mode at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToActive(AdminService.java:321) at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeActive(EmbeddedElectorService.java:126) ... 4 more Caused by: org.apache.hadoop.ha.ServiceFailedException: java.io.IOException: Failed to re-init queues at org.apache.hadoop.yarn.server.resourcemanager.AdminService.refreshAll(AdminService.java:617) at