[jira] [Updated] (YARN-1932) Javascript injection on the job status page
[ https://issues.apache.org/jira/browse/YARN-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-1932: Priority: Blocker (was: Critical) Javascript injection on the job status page --- Key: YARN-1932 URL: https://issues.apache.org/jira/browse/YARN-1932 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 0.23.9, 2.5.0 Reporter: Mit Desai Assignee: Mit Desai Priority: Blocker Attachments: YARN-1932.patch Scripts can be injected into the job status page as the diagnostics field is not sanitized. Whatever string you set there will show up to the jobs page as it is ... ie. if you put any script commands, they will be executed in the browser of the user who is opening the page. We need escaping the diagnostic string in order to not run the scripts. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1935) Security for timeline service
[ https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen reassigned YARN-1935: - Assignee: Zhijie Shen Security for timeline service - Key: YARN-1935 URL: https://issues.apache.org/jira/browse/YARN-1935 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Zhijie Shen Jira to track work to secure the ATS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1935) Security for timeline service
[ https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968121#comment-13968121 ] Zhijie Shen commented on YARN-1935: --- I'm going to take care of the security issues of the timeline server Security for timeline service - Key: YARN-1935 URL: https://issues.apache.org/jira/browse/YARN-1935 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Zhijie Shen Jira to track work to secure the ATS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1935) Security for timeline server
[ https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1935: -- Summary: Security for timeline server (was: Security for timeline service) Security for timeline server Key: YARN-1935 URL: https://issues.apache.org/jira/browse/YARN-1935 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Zhijie Shen Jira to track work to secure the ATS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1935) Security for timeline service
[ https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1935: -- Issue Type: New Feature (was: Sub-task) Parent: (was: YARN-1530) Security for timeline service - Key: YARN-1935 URL: https://issues.apache.org/jira/browse/YARN-1935 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Zhijie Shen Jira to track work to secure the ATS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-996) REST API support for node resource configuration
[ https://issues.apache.org/jira/browse/YARN-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968123#comment-13968123 ] Kenji Kikushima commented on YARN-996: -- [~tgraves], thanks for your comment. Certainly, we should add test for admin acls, but updateNodeResource isn't protected. I think we should check acls in updateNodeResource. So, I searched related JIRA, but I couldn't find. Can I create a JIRA about this issue? [~djp], please let us know what do you think. REST API support for node resource configuration Key: YARN-996 URL: https://issues.apache.org/jira/browse/YARN-996 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, scheduler Reporter: Junping Du Assignee: Kenji Kikushima Attachments: YARN-996-sample.patch Besides admin protocol and CLI, REST API should also be supported for node resource configuration -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1936) Secured timeline client
Zhijie Shen created YARN-1936: - Summary: Secured timeline client Key: YARN-1936 URL: https://issues.apache.org/jira/browse/YARN-1936 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1937) Access control of per-framework data
Zhijie Shen created YARN-1937: - Summary: Access control of per-framework data Key: YARN-1937 URL: https://issues.apache.org/jira/browse/YARN-1937 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1938) Kerberos authentication for the timeline server
Zhijie Shen created YARN-1938: - Summary: Kerberos authentication for the timeline server Key: YARN-1938 URL: https://issues.apache.org/jira/browse/YARN-1938 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Zhijie Shen -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-996) REST API support for node resource configuration
[ https://issues.apache.org/jira/browse/YARN-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968165#comment-13968165 ] Junping Du commented on YARN-996: - Thanks both for good points on adding check ACL for this API which we missed before. [~kj-ki], feel free to file a separated JIRA on it under YARN-291 and linked it with YARN-312. Thx! REST API support for node resource configuration Key: YARN-996 URL: https://issues.apache.org/jira/browse/YARN-996 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, scheduler Reporter: Junping Du Assignee: Kenji Kikushima Attachments: YARN-996-sample.patch Besides admin protocol and CLI, REST API should also be supported for node resource configuration -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968211#comment-13968211 ] Rohith commented on YARN-1861: -- Oops, I too encounterd with both RM is standy by state forever :-( The trace is same as Arpit Gupta given in his comment. And another observation is same as Vinod's observation. After deleting lock,leader election started and same RM became active. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Vinod Kumar Vavilapalli Priority: Critical In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1933) TestAMRestart and TestNodeHealthService failing sometimes on Windows
[ https://issues.apache.org/jira/browse/YARN-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968243#comment-13968243 ] Hudson commented on YARN-1933: -- FAILURE: Integrated in Hadoop-Yarn-trunk #540 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/540/]) YARN-1933. Fixed test issues with TestAMRestart and TestNodeHealthService. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587104) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java TestAMRestart and TestNodeHealthService failing sometimes on Windows Key: YARN-1933 URL: https://issues.apache.org/jira/browse/YARN-1933 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.4.1 Attachments: YARN-1933.1.patch, YARN-1933.2.patch TestNodeHealthService failures: testNodeHealthScript(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService) Time elapsed: 1.405 sec ERROR! java.io.FileNotFoundException: C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd (The process cannot access the file because it is being used by another process) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScript(TestNodeHealthService.java:154) testNodeHealthScriptShouldRun(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService) Time elapsed: 0 sec ERROR! java.io.FileNotFoundException: C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd (Access is denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScriptShouldRun(TestNodeHealthService.java:103) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1928) TestAMRMRPCNodeUpdates fails ocassionally
[ https://issues.apache.org/jira/browse/YARN-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968245#comment-13968245 ] Hudson commented on YARN-1928: -- FAILURE: Integrated in Hadoop-Yarn-trunk #540 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/540/]) YARN-1928. Fixed a race condition in TestAMRMRPCNodeUpdates which caused it to fail occassionally. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587114) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRMRPCNodeUpdates.java TestAMRMRPCNodeUpdates fails ocassionally - Key: YARN-1928 URL: https://issues.apache.org/jira/browse/YARN-1928 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.4.1 Attachments: YARN-1928.1.patch {code} junit.framework.AssertionFailedError: expected:0 but was:4 at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at junit.framework.Assert.assertEquals(Assert.java:199) at junit.framework.Assert.assertEquals(Assert.java:205) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:136) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1928) TestAMRMRPCNodeUpdates fails ocassionally
[ https://issues.apache.org/jira/browse/YARN-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968361#comment-13968361 ] Hudson commented on YARN-1928: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1732 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1732/]) YARN-1928. Fixed a race condition in TestAMRMRPCNodeUpdates which caused it to fail occassionally. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587114) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRMRPCNodeUpdates.java TestAMRMRPCNodeUpdates fails ocassionally - Key: YARN-1928 URL: https://issues.apache.org/jira/browse/YARN-1928 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.4.1 Attachments: YARN-1928.1.patch {code} junit.framework.AssertionFailedError: expected:0 but was:4 at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at junit.framework.Assert.assertEquals(Assert.java:199) at junit.framework.Assert.assertEquals(Assert.java:205) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:136) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1933) TestAMRestart and TestNodeHealthService failing sometimes on Windows
[ https://issues.apache.org/jira/browse/YARN-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968359#comment-13968359 ] Hudson commented on YARN-1933: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1732 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1732/]) YARN-1933. Fixed test issues with TestAMRestart and TestNodeHealthService. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587104) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java TestAMRestart and TestNodeHealthService failing sometimes on Windows Key: YARN-1933 URL: https://issues.apache.org/jira/browse/YARN-1933 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.4.1 Attachments: YARN-1933.1.patch, YARN-1933.2.patch TestNodeHealthService failures: testNodeHealthScript(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService) Time elapsed: 1.405 sec ERROR! java.io.FileNotFoundException: C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd (The process cannot access the file because it is being used by another process) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScript(TestNodeHealthService.java:154) testNodeHealthScriptShouldRun(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService) Time elapsed: 0 sec ERROR! java.io.FileNotFoundException: C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd (Access is denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScriptShouldRun(TestNodeHealthService.java:103) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1929: --- Attachment: yarn-1929-1.patch Here is a first-cut patch that removes unnecessary synchronization from EmbeddedElectorService, AdminService and CompositeService. Thinking about the best way to write a unit test for this to avoid regressions in the future. We can may be override becomeActive to sleep for some time and try to shut the RM down. If it doesn't shutdown within a particular amount of time, fail the test? Any other ideas? DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1928) TestAMRMRPCNodeUpdates fails ocassionally
[ https://issues.apache.org/jira/browse/YARN-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968413#comment-13968413 ] Hudson commented on YARN-1928: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1757 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1757/]) YARN-1928. Fixed a race condition in TestAMRMRPCNodeUpdates which caused it to fail occassionally. Contributed by Zhijie Shen. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587114) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRMRPCNodeUpdates.java TestAMRMRPCNodeUpdates fails ocassionally - Key: YARN-1928 URL: https://issues.apache.org/jira/browse/YARN-1928 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Zhijie Shen Fix For: 2.4.1 Attachments: YARN-1928.1.patch {code} junit.framework.AssertionFailedError: expected:0 but was:4 at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at junit.framework.Assert.assertEquals(Assert.java:199) at junit.framework.Assert.assertEquals(Assert.java:205) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:136) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1933) TestAMRestart and TestNodeHealthService failing sometimes on Windows
[ https://issues.apache.org/jira/browse/YARN-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968411#comment-13968411 ] Hudson commented on YARN-1933: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1757 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1757/]) YARN-1933. Fixed test issues with TestAMRestart and TestNodeHealthService. Contributed by Jian He. (vinodkv: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587104) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java TestAMRestart and TestNodeHealthService failing sometimes on Windows Key: YARN-1933 URL: https://issues.apache.org/jira/browse/YARN-1933 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Fix For: 2.4.1 Attachments: YARN-1933.1.patch, YARN-1933.2.patch TestNodeHealthService failures: testNodeHealthScript(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService) Time elapsed: 1.405 sec ERROR! java.io.FileNotFoundException: C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd (The process cannot access the file because it is being used by another process) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScript(TestNodeHealthService.java:154) testNodeHealthScriptShouldRun(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService) Time elapsed: 0 sec ERROR! java.io.FileNotFoundException: C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd (Access is denied) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at java.io.FileOutputStream.init(FileOutputStream.java:171) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82) at org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScriptShouldRun(TestNodeHealthService.java:103) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968448#comment-13968448 ] Tsuyoshi OZAWA commented on YARN-1929: -- Thinking about the best way to write a unit test for this to avoid regressions in the future. Your approach looks reasonable for me. In addition to overriding EES#becomeActive, we can override synchronized methods or change their behaviour(CompositeService#stop, AS#transitionToActive, RM#transitionToActive) to sleep with synchronization(a bit different, but like TestRetryCacheWithHA#DummyRetryInvocationHandler). Then we can reproduce deadlock situation step by step in test cases. IMHO, we shouldn't touch ASE, because it's also used in NameNode HA. DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968458#comment-13968458 ] Hadoop QA commented on YARN-1929: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12640071/yarn-1929-1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3566//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/3566//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3566//console This message is automatically generated. DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1337) Recover active container state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated YARN-1337: --- Assignee: (was: Ravi Prakash) Recover active container state upon nodemanager restart --- Key: YARN-1337 URL: https://issues.apache.org/jira/browse/YARN-1337 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe To support work-preserving NM restart we need to recover the state of the containers that were active when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968487#comment-13968487 ] Tsuyoshi OZAWA commented on YARN-1861: -- [~kasha], I think this problem looks very similar to YARN-1929 - deadlock after losing ZK session. (*ASE#processResult* - *EES#becomeStandby* - *AS#transitionToStandby* - *RM#transitionToStandby*) and (RM#serviceStop - RM.super#serviceStop - *RM.super#stop* - AS#stop - *AS#serviceStop* - *EES#serviceStop* - *ASE#quitElection*) IIUC, Karthik's patch on YARN-1929 partially solve this problem, but not completely. Please correct me if I get wrong. Thanks. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Vinod Kumar Vavilapalli Priority: Critical In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1281) TestZKRMStateStoreZKClientConnections fails intermittently
[ https://issues.apache.org/jira/browse/YARN-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968480#comment-13968480 ] Mit Desai commented on YARN-1281: - [~kasha], are you still seeing this test failing? TestZKRMStateStoreZKClientConnections fails intermittently -- Key: YARN-1281 URL: https://issues.apache.org/jira/browse/YARN-1281 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Karthik Kambatla Assignee: Karthik Kambatla The test fails intermittently - haven't been able to reproduce the failure deterministically. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1355) Recover application ACLs upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned YARN-1355: Assignee: Jason Lowe Recover application ACLs upon nodemanager restart - Key: YARN-1355 URL: https://issues.apache.org/jira/browse/YARN-1355 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe The ACLs for applications need to be recovered for work-preserving nodemanager restart. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1354) Recover applications upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned YARN-1354: Assignee: Jason Lowe Recover applications upon nodemanager restart - Key: YARN-1354 URL: https://issues.apache.org/jira/browse/YARN-1354 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe The set of active applications in the nodemanager context need to be recovered for work-preserving nodemanager restart -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1352) Recover LogAggregationService upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned YARN-1352: Assignee: Jason Lowe Recover LogAggregationService upon nodemanager restart -- Key: YARN-1352 URL: https://issues.apache.org/jira/browse/YARN-1352 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe LogAggregationService state needs to be recovered as part of the work-preserving nodemanager restart feature. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-1337) Recover active container state upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe reassigned YARN-1337: Assignee: Jason Lowe Recover active container state upon nodemanager restart --- Key: YARN-1337 URL: https://issues.apache.org/jira/browse/YARN-1337 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe To support work-preserving NM restart we need to recover the state of the containers that were active when the nodemanager went down. This includes informing the RM of containers that have exited in the interim and a strategy for dealing with the exit codes from those containers along with how to reacquire the active containers and determine their exit codes when they terminate. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1281) TestZKRMStateStoreZKClientConnections fails intermittently
[ https://issues.apache.org/jira/browse/YARN-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968501#comment-13968501 ] Karthik Kambatla commented on YARN-1281: Yes. Almost in every nightly run. I have been caught up with other things, and haven't been able to look into this. Temporarily marked it Unassigned so someone else can pick it. Will take it back when I get a chance to fix. TestZKRMStateStoreZKClientConnections fails intermittently -- Key: YARN-1281 URL: https://issues.apache.org/jira/browse/YARN-1281 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Karthik Kambatla The test fails intermittently - haven't been able to reproduce the failure deterministically. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1281) TestZKRMStateStoreZKClientConnections fails intermittently
[ https://issues.apache.org/jira/browse/YARN-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1281: --- Assignee: (was: Karthik Kambatla) TestZKRMStateStoreZKClientConnections fails intermittently -- Key: YARN-1281 URL: https://issues.apache.org/jira/browse/YARN-1281 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Reporter: Karthik Kambatla The test fails intermittently - haven't been able to reproduce the failure deterministically. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1939) Improve the packaging of AmIpFilter
Thomas Graves created YARN-1939: --- Summary: Improve the packaging of AmIpFilter Key: YARN-1939 URL: https://issues.apache.org/jira/browse/YARN-1939 Project: Hadoop YARN Issue Type: Improvement Components: api, webapp Affects Versions: 2.4.0 Reporter: Thomas Graves It is recommended for applications to use the AmIpFilter to properly secure any WebUI that is specific to that application. The packaging of the AmIpFilter is in org.apache.hadoop.yarn.server.webproxy.amfilter, which requires a application to pull in yarn-server as a dependency which isn't very user friendly for applications wanting to pick up the bare minimum. We should improve the packaging of that so it can be pulled in independently. We do need to be careful to keep it backwards compatible atleast in the 2.x release line also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled
[ https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968520#comment-13968520 ] Tsuyoshi OZAWA commented on YARN-1861: -- Thank you for pointing, Karthik. I'll continue to check code again. Both RM stuck in standby mode when automatic failover is enabled Key: YARN-1861 URL: https://issues.apache.org/jira/browse/YARN-1861 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Arpit Gupta Assignee: Vinod Kumar Vavilapalli Priority: Critical In our HA tests we noticed that the tests got stuck because both RM's got into standby state and no one became active. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse
[ https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968521#comment-13968521 ] Xuan Gong commented on YARN-1897: - [~mingma] bq. For SignalContainerResponse, what is the semantics of isCMDCompleted? If we want to support synchronous signal container call and this flag indicates whether ContainerExecutor has signaled on the container successfully, that will require RM to wait for the response from NM after NM finishes the work; it implies ApplicationClientProtocol's signalContainer method will hold up a RPC handler for some period of time; we can have some time out or rate limiting on signalContainer call to make sure applications won't be able to consume all RM's RPC handlers. If isCMDCompleted means if the command has been submitted to RM successfully, then it is ok; or we can use exception to indicate failure of the request. OK. We should try the best to do it asynchronously. We will reply on node heartbeat to send the container command to related NM. After NM executes the commands, it can send response(whether the cmd is finished successfully) back to RM with the node heartbeat, too. But this will bring us another questions. Because we can not control how much the NM need to execute the cmds and send back to RM, we can not give a detail time on how long the client should wait for the response. Also, we need to consider the RM Restart, RM Failover, etc. To make progress, i think that right now, check whether command has submitted to RM successfully (check whether container is exist or not, whether the container has already been kill, etc), might be fine for now. So, keep isCMDCompleted in SignalContainerResponse ? What do you think ? Define SignalContainerRequest and SignalContainerResponse - Key: YARN-1897 URL: https://issues.apache.org/jira/browse/YARN-1897 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Ming Ma We need to define SignalContainerRequest and SignalContainerResponse first as they are needed by other sub tasks. SignalContainerRequest should use OS-independent commands and provide a way to application to specify reason for diagnosis. SignalContainerResponse might be empty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1939) Improve the packaging of AmIpFilter
[ https://issues.apache.org/jira/browse/YARN-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968531#comment-13968531 ] Thomas Graves commented on YARN-1939: - Looks like I was mistaken. the web proxy is actually in its own jar. closing this. Improve the packaging of AmIpFilter --- Key: YARN-1939 URL: https://issues.apache.org/jira/browse/YARN-1939 Project: Hadoop YARN Issue Type: Improvement Components: api, webapp Affects Versions: 2.4.0 Reporter: Thomas Graves It is recommended for applications to use the AmIpFilter to properly secure any WebUI that is specific to that application. The packaging of the AmIpFilter is in org.apache.hadoop.yarn.server.webproxy.amfilter, which requires a application to pull in yarn-server as a dependency which isn't very user friendly for applications wanting to pick up the bare minimum. We should improve the packaging of that so it can be pulled in independently. We do need to be careful to keep it backwards compatible atleast in the 2.x release line also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1939) Improve the packaging of AmIpFilter
[ https://issues.apache.org/jira/browse/YARN-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved YARN-1939. - Resolution: Invalid Improve the packaging of AmIpFilter --- Key: YARN-1939 URL: https://issues.apache.org/jira/browse/YARN-1939 Project: Hadoop YARN Issue Type: Improvement Components: api, webapp Affects Versions: 2.4.0 Reporter: Thomas Graves It is recommended for applications to use the AmIpFilter to properly secure any WebUI that is specific to that application. The packaging of the AmIpFilter is in org.apache.hadoop.yarn.server.webproxy.amfilter, which requires a application to pull in yarn-server as a dependency which isn't very user friendly for applications wanting to pick up the bare minimum. We should improve the packaging of that so it can be pulled in independently. We do need to be careful to keep it backwards compatible atleast in the 2.x release line also. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse
[ https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968544#comment-13968544 ] Ming Ma commented on YARN-1897: --- Sounds good. How about IsCMDSubmissionCompleted? Define SignalContainerRequest and SignalContainerResponse - Key: YARN-1897 URL: https://issues.apache.org/jira/browse/YARN-1897 Project: Hadoop YARN Issue Type: Sub-task Components: api Reporter: Ming Ma We need to define SignalContainerRequest and SignalContainerResponse first as they are needed by other sub tasks. SignalContainerRequest should use OS-independent commands and provide a way to application to specify reason for diagnosis. SignalContainerResponse might be empty. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-435) Make it easier to access cluster topology information in an AM
[ https://issues.apache.org/jira/browse/YARN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968558#comment-13968558 ] Bikas Saha commented on YARN-435: - Pasting the description from YARN-1722 that was closed as a dup of this. {code}There is no way for an AM to find out the names of all the nodes in the cluster via the AMRMProtocol. An AM can only at best ask for containers at * location. The only way to get that information is via the ClientRMProtocol but that is secured by Kerberos or RMDelegationToken while the AM has an AMRMToken. This is a pretty important piece of missing functionality. There are other jiras opened about getting cluster topology etc. but they havent been addressed due to a clear definition of cluster topology perhaps. Adding a means to at least get the node information would be a good first step.{code} This jira may have stalled in trying to figure out how to layout the topology. YARN-1722 simply asks for the list of nodes in the cluster. While defining a way to generically describe topology may be tricky - all such methods must list all the nodes in the cluster. So YARN-1722 is a much simpler problem. Whatever object we define for topology, can start with a simple list of nodes and then use the integer id of the nodes (for compaction) in the list to reference them in the other objects describing the hierarchy. Make it easier to access cluster topology information in an AM -- Key: YARN-435 URL: https://issues.apache.org/jira/browse/YARN-435 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Omkar Vinit Joshi ClientRMProtocol exposes a getClusterNodes api that provides a report on all nodes in the cluster including their rack information. However, this requires the AM to open and establish a separate connection to the RM in addition to one for the AMRMProtocol. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968653#comment-13968653 ] Steve Loughran commented on YARN-1929: -- I'm +1 to the change to composite service, as well as making the serviceXYZ operations desyncrhonized (the state entry point in the public method is synchronized to prevent re-entrancy. I'll leave it to others to look at the remaining code and comment Now, there is one little quirk by desynchronizing the serviceStart() and serviceStop methods. Although it is still impossible to have 1 thread successfully entering either method, there is the sequence {code} Thread 1 : service.start() Thread 1: service.serviceStart() begins Thread 2 : service.stop() Thread 2: service.serviceStop() begins Thread 2: service.serviceStop() completes Thread 1: service start completes {code} That's because we're not making any attempt to include transitive states, it generally makes things too complex -and that includes handling the problem of what is the policy if I try to call stop midway through starting DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968659#comment-13968659 ] Jian He commented on YARN-1879: --- RetryCache is used to handle retries at *RPC level* to return the previous response for duplicate non-idempotent requests. Allocate() call already has a similar retry cache mechanism by checking the request Id of the request, but that also serves the *App level* retry. [~vinodkv], is it fine to keep both? some comments on the patch: testAPIsWithRetryCache: - Assert exception type inside catch block. {code} } catch (InvalidApplicationMasterRequestException e) { } } catch (InvalidApplicationMasterRequestException e) { // InvalidApplicationMasterRequestException is thrown // after expiring RetryCache } {code} - TestNamenodeRetryCache class comment is useful for TestApplicationMasterServiceRetryCache too, we can copy that over. - org.junit.Assert.assertEquals - assertEquals some suggestions on the configs to conform with existing YarnConfigs: - RM_APPMASTER_ENABLE_RETRY_CACHE_KEY - RM_RETRY_CACHE_ENABLED - Remove the APPMASTER in the config name so that the same config can be used for other service later on? - Use RM_PREFIX instead of YARN_PREFIX; - RM_APPMASTER_ENABLE_RETRY_CACHE_DEFAULT- DEFAULT_RM_RETRY_CACHE_ENABLED to conform with the naming of other default configs. - RM_APPMASTER_RETRY_CACHE_HEAP_PERCENT_KEY - RM_RETRY_CACHE_HEAP_PERCENT - RM_APPMASTER_RETRY_CACHE_EXPIRYTIME_MILLIS_KEY-RM_RETRY_CACHE_EXPIRY_MS Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1929: --- Attachment: yarn-1929-2.patch Here is a new patch that adds a test. Thanks Steve for taking a look at the patch and for the offline input on removing the synchronization from CompositeService#stop. [~jianhe], [~xgong] - will either of you be able to take a look at the patch? DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch, yarn-1929-2.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-558) Add ability to completely remove nodemanager from resourcemanager.
[ https://issues.apache.org/jira/browse/YARN-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968800#comment-13968800 ] Christian Smith commented on YARN-558: -- +1 For this as well. This is essential for any public (or private) cloud scenario where one wants elastic clusters. Add ability to completely remove nodemanager from resourcemanager. -- Key: YARN-558 URL: https://issues.apache.org/jira/browse/YARN-558 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Garth Goodson Priority: Minor Labels: feature I would like to add the ability to completely remove a nodemanager from the resourcemanager's state. I run a cloud service where I want to dynamically bring up nodes to act as nodemanagers and then bring them down again when not needed. These nodes have dynamically assigned IPs, thus the alternative of decommissioning them via an excludes file leads to a large (unbounded) list of decommissioned nodes that may never be commissioned again. I would like the ability to move a node from a decommissioned state to completely removing it from the resource manager. I have thought of two ways of implementing this. 1) Add an optional timeout between the decommission state - being removed from the nodemanager. 2) Add an explicit RPC to remove a node that is decommissioned. Any additional thoughts/discussion are welcome. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968808#comment-13968808 ] Karthik Kambatla commented on YARN-1929: bq. Now, there is one little quirk by desynchronizing the serviceStart() and serviceStop methods. Although it is still impossible to have 1 thread successfully entering either method, there is the sequence AbstractService.stateChangeLock appears to allow a single thread to make any state change at a given point in time. In the example, Thread 2's stop would wait for Thread1's start() to complete. No? DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch, yarn-1929-2.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1927) Preemption message shouldn’t be created multiple times for same container-id in ProportionalCapacityPreemptionPolicy
[ https://issues.apache.org/jira/browse/YARN-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968830#comment-13968830 ] Carlo Curino commented on YARN-1927: I agree with Chris's explanation, we are sustaining an ask, sort of continuously publishing our latest belief about preemption (e.g., a massive job could release lots of resources, and relief the pressure that caused us to ask containers back)... The choice of how to propagate this information to the AM is somewhat irrelevant. The current solution tries to be more or less aligned with the general style of protocols and it has a couple of nice properties: 1) losing a message is not a problem, and 2) it is simple for the AM to reconstruct the latest RM opinion about resource preemption (simply re-stated in each message). Repeating the ask, confirms to the AM that the need for preemption is still there... this does convey some useful information to the AM. Given the scale/frequency of operations this doesn't seem to be a perf concern either. It is a matter of taste whether one likes better sustaining an ask or start-stopping it. Since this is a public protocol and I would suggest we consider to change it only if there is a substantial gain in expressivity or performance... I don't see one at the moment (but I might be missing your point). What you propose seems a plausible alternative, but not substantially better than what's already there, so I would lean towards leaving it be. Preemption message shouldn’t be created multiple times for same container-id in ProportionalCapacityPreemptionPolicy Key: YARN-1927 URL: https://issues.apache.org/jira/browse/YARN-1927 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.4.0 Reporter: Wangda Tan Assignee: Wangda Tan Priority: Minor Attachments: YARN-1927.patch Currently, after each editSchedule() called, preemption message will be created and sent to scheduler. ProportionalCapacityPreemptionPolicy should only send preemption message once for each container. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-1940) deleteAsUser() terminates early without deleting more files on error
Kihwal Lee created YARN-1940: Summary: deleteAsUser() terminates early without deleting more files on error Key: YARN-1940 URL: https://issues.apache.org/jira/browse/YARN-1940 Project: Hadoop YARN Issue Type: Bug Reporter: Kihwal Lee In container-executor.c, delete_path() returns early when unlink() against a file or a symlink fails. We have seen many cases of the error being ENOENT, which can safely be ignored during delete. This is what we saw recently: An app mistakenly created a large number of files in the local directory and the deletion service failed to delete a significant portion of them due to this bug. Repeatedly hitting this on the same node led to exhaustion of inodes in one of the partitions. Beside ignoring ENOENT, delete_path() can simply skip the failed one and continue in some cases, rather than aborting and leaving files behind. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968874#comment-13968874 ] Hadoop QA commented on YARN-1929: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12640136/yarn-1929-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3567//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3567//console This message is automatically generated. DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch, yarn-1929-2.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1190) enabling uber mode with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb
[ https://issues.apache.org/jira/browse/YARN-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li resolved YARN-1190. --- Resolution: Duplicate Assignee: Siqi Li enabling uber mode with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb - Key: YARN-1190 URL: https://issues.apache.org/jira/browse/YARN-1190 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.1.1-beta, 2.0.6-alpha Reporter: Siqi Li Assignee: Siqi Li Priority: Minor Attachments: YARN-1190_v1.patch.txt Since there is no reducer, the memory allocated to reducer is irrelevant to enable uber mode of a job -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability
[ https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sietse T. Au updated YARN-1902: --- Affects Version/s: 2.4.0 Allocation of too many containers when a second request is done with the same resource capability - Key: YARN-1902 URL: https://issues.apache.org/jira/browse/YARN-1902 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.2.0, 2.3.0, 2.4.0 Reporter: Sietse T. Au Labels: patch Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch Regarding AMRMClientImpl Scenario 1: Given a ContainerRequest x with Resource y, when addContainerRequest is called z times with x, allocate is called and at least one of the z allocated containers is started, then if another addContainerRequest call is done and subsequently an allocate call to the RM, (z+1) containers will be allocated, where 1 container is expected. Scenario 2: No containers are started between the allocate calls. Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) are requested in both scenarios, but that only in the second scenario, the correct behavior is observed. Looking at the implementation I have found that this (z+1) request is caused by the structure of the remoteRequestsTable. The consequence of MapResource, ResourceRequestInfo is that ResourceRequestInfo does not hold any information about whether a request has been sent to the RM yet or not. There are workarounds for this, such as releasing the excess containers received. The solution implemented is to initialize a new ResourceRequest in ResourceRequestInfo when a request has been successfully sent to the RM. The patch includes a test in which scenario one is tested. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1391) Lost node list should be identify by NodeId
[ https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968906#comment-13968906 ] Siqi Li commented on YARN-1391: --- I think this patch can still be applied to the trunk Lost node list should be identify by NodeId --- Key: YARN-1391 URL: https://issues.apache.org/jira/browse/YARN-1391 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.5-alpha Reporter: Siqi Li Assignee: Siqi Li Attachments: YARN-1391.v1.patch in case of multiple node managers on a single machine. each of them should be identified by NodeId, which is more unique than just host name -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1415) In scheduler UI, including used memory in Memory Total seems to be inaccurate
[ https://issues.apache.org/jira/browse/YARN-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siqi Li resolved YARN-1415. --- Resolution: Not a Problem In scheduler UI, including used memory in Memory Total seems to be inaccurate --- Key: YARN-1415 URL: https://issues.apache.org/jira/browse/YARN-1415 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Reporter: Siqi Li Fix For: 2.1.0-beta Attachments: 1.png, 2.png Memory Total is currently a sum of availableMB, allocatedMB, and reservedMB. It seems that the term availableMB actually means total memory, since it doesn't get decreased when some jobs use a certain amount of memory. Hence, the Memory Total should not include allocatedMB, or availableMB doesn't get updated properly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1414) with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs
[ https://issues.apache.org/jira/browse/YARN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968909#comment-13968909 ] Siqi Li commented on YARN-1414: --- For now, this patch is able to update leaf queues and root queue correctly, Can someone take a look at this and merge it into the latest branch ? with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs - Key: YARN-1414 URL: https://issues.apache.org/jira/browse/YARN-1414 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, scheduler Affects Versions: 2.0.5-alpha Reporter: Siqi Li Assignee: Siqi Li Fix For: 2.2.0 Attachments: YARN-1221-subtask.v1.patch.txt, YARN-1221-v2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.
[ https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968966#comment-13968966 ] Jian He commented on YARN-1929: --- patch looks good to me, wait for others to also take a look. DeadLock in RM when automatic failover is enabled. -- Key: YARN-1929 URL: https://issues.apache.org/jira/browse/YARN-1929 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Environment: Yarn HA cluster Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: yarn-1929-1.patch, yarn-1929-2.patch Dead lock detected in RM when automatic failover is enabled. {noformat} Found one Java-level deadlock: = Thread-2: waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a org.apache.hadoop.ha.ActiveStandbyElector), which is held by main-EventThread main-EventThread: waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService), which is held by Thread-2 {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968965#comment-13968965 ] Karthik Kambatla commented on YARN-1934: Ideally, we should guard against accessing zkClient with a getZkClient() method that attempts a new connection if zkClient is null. Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Critical Attachments: RM.txt For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968967#comment-13968967 ] Karthik Kambatla commented on YARN-1934: Or, move all uses to runWithCheck(). Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Critical Attachments: RM.txt For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968980#comment-13968980 ] Jian He commented on YARN-1934: --- I think we should have all uses of zkClient with runWithCheck() Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Critical Attachments: RM.txt For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968987#comment-13968987 ] Karthik Kambatla commented on YARN-1934: Working on a patch, will try to post something later today or tomorrow. Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Critical Attachments: RM.txt For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1934: --- Priority: Blocker (was: Critical) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: RM.txt For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-1934: --- Attachment: yarn-1934-0.patch Here is a patch that replaces all direct references to zkClient and uses runWithCheck instead. ZKRMStateStore has become very unwieldy and hard to manage. We should definitely clean it up. Thinking about it, I think we should just have have RMZooKeeper and/or RMFencingZooKeeper classes that extend/wrap ZooKeeper and override the methods we need. That would make the remaining ZK code much easier to read and maintain. Would like to work on this on a separate JIRA target for 2.5. Haven't added a test in the patch. I am tempted not to, given the intention to cleanup/revamp the ZK interactions. Can add one if insisted. Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: RM.txt, yarn-1934-0.patch For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969092#comment-13969092 ] Hadoop QA commented on YARN-1934: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12640175/yarn-1934-0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3568//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3568//console This message is automatically generated. Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: RM.txt, yarn-1934-0.patch For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
[ https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969225#comment-13969225 ] Rohith commented on YARN-1934: -- +1 patch looks good to me :-) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK. -- Key: YARN-1934 URL: https://issues.apache.org/jira/browse/YARN-1934 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Rohith Assignee: Karthik Kambatla Priority: Blocker Attachments: RM.txt, yarn-1934-0.patch For ZK disconnected event , zkClient is set to null. It is very much prone to throw NPE. {noformat} case Disconnected: LOG.info(ZKRMStateStore Session disconnected); oldZkClient = zkClient; zkClient = null; break; {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)