[jira] [Updated] (YARN-1932) Javascript injection on the job status page

2014-04-14 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy updated YARN-1932:


Priority: Blocker  (was: Critical)

 Javascript injection on the job status page
 ---

 Key: YARN-1932
 URL: https://issues.apache.org/jira/browse/YARN-1932
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 0.23.9, 2.5.0
Reporter: Mit Desai
Assignee: Mit Desai
Priority: Blocker
 Attachments: YARN-1932.patch


 Scripts can be injected into the job status page as the diagnostics field is
 not sanitized. Whatever string you set there will show up to the jobs page as 
 it is ... ie. if you put any script commands, they will be executed in the 
 browser of the user who is opening the page.
 We need escaping the diagnostic string in order to not run the scripts.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1935) Security for timeline service

2014-04-14 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen reassigned YARN-1935:
-

Assignee: Zhijie Shen

 Security for timeline service
 -

 Key: YARN-1935
 URL: https://issues.apache.org/jira/browse/YARN-1935
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Zhijie Shen

 Jira to track work to secure the ATS



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1935) Security for timeline service

2014-04-14 Thread Zhijie Shen (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968121#comment-13968121
 ] 

Zhijie Shen commented on YARN-1935:
---

I'm going to take care of the security issues of the timeline server

 Security for timeline service
 -

 Key: YARN-1935
 URL: https://issues.apache.org/jira/browse/YARN-1935
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Zhijie Shen

 Jira to track work to secure the ATS



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1935) Security for timeline server

2014-04-14 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1935:
--

Summary: Security for timeline server  (was: Security for timeline service)

 Security for timeline server
 

 Key: YARN-1935
 URL: https://issues.apache.org/jira/browse/YARN-1935
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Zhijie Shen

 Jira to track work to secure the ATS



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1935) Security for timeline service

2014-04-14 Thread Zhijie Shen (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhijie Shen updated YARN-1935:
--

Issue Type: New Feature  (was: Sub-task)
Parent: (was: YARN-1530)

 Security for timeline service
 -

 Key: YARN-1935
 URL: https://issues.apache.org/jira/browse/YARN-1935
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Zhijie Shen

 Jira to track work to secure the ATS



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-996) REST API support for node resource configuration

2014-04-14 Thread Kenji Kikushima (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968123#comment-13968123
 ] 

Kenji Kikushima commented on YARN-996:
--

[~tgraves], thanks for your comment. Certainly, we should add test for admin 
acls, but updateNodeResource isn't protected. I think we should check acls in 
updateNodeResource. So, I searched related JIRA, but I couldn't find. Can I 
create a JIRA about this issue? [~djp], please let us know what do you think.

 REST API support for node resource configuration
 

 Key: YARN-996
 URL: https://issues.apache.org/jira/browse/YARN-996
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, scheduler
Reporter: Junping Du
Assignee: Kenji Kikushima
 Attachments: YARN-996-sample.patch


 Besides admin protocol and CLI, REST API should also be supported for node 
 resource configuration



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1936) Secured timeline client

2014-04-14 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-1936:
-

 Summary: Secured timeline client
 Key: YARN-1936
 URL: https://issues.apache.org/jira/browse/YARN-1936
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1937) Access control of per-framework data

2014-04-14 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-1937:
-

 Summary: Access control of per-framework data
 Key: YARN-1937
 URL: https://issues.apache.org/jira/browse/YARN-1937
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1938) Kerberos authentication for the timeline server

2014-04-14 Thread Zhijie Shen (JIRA)
Zhijie Shen created YARN-1938:
-

 Summary: Kerberos authentication for the timeline server
 Key: YARN-1938
 URL: https://issues.apache.org/jira/browse/YARN-1938
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Zhijie Shen
Assignee: Zhijie Shen






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-996) REST API support for node resource configuration

2014-04-14 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968165#comment-13968165
 ] 

Junping Du commented on YARN-996:
-

Thanks both for good points on adding check ACL for this API which we missed 
before. [~kj-ki], feel free to file a separated JIRA on it under YARN-291 and 
linked it with YARN-312. Thx! 

 REST API support for node resource configuration
 

 Key: YARN-996
 URL: https://issues.apache.org/jira/browse/YARN-996
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager, scheduler
Reporter: Junping Du
Assignee: Kenji Kikushima
 Attachments: YARN-996-sample.patch


 Besides admin protocol and CLI, REST API should also be supported for node 
 resource configuration



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968211#comment-13968211
 ] 

Rohith commented on YARN-1861:
--

Oops, I too encounterd with both RM is standy by state forever :-(  

The trace is same as Arpit Gupta  given in his comment. And another observation 
is same as Vinod's observation. After deleting lock,leader election started and 
same RM became active.

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Vinod Kumar Vavilapalli
Priority: Critical

 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1933) TestAMRestart and TestNodeHealthService failing sometimes on Windows

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968243#comment-13968243
 ] 

Hudson commented on YARN-1933:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #540 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/540/])
YARN-1933. Fixed test issues with TestAMRestart and TestNodeHealthService. 
Contributed by Jian He. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587104)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java


 TestAMRestart and TestNodeHealthService failing sometimes on Windows
 

 Key: YARN-1933
 URL: https://issues.apache.org/jira/browse/YARN-1933
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.4.1

 Attachments: YARN-1933.1.patch, YARN-1933.2.patch


 TestNodeHealthService failures:
 testNodeHealthScript(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService)
   Time elapsed: 1.405 sec   ERROR!
 java.io.FileNotFoundException: 
 C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd
  (The process cannot access the file because it is being used by another 
 process)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at java.io.FileOutputStream.init(FileOutputStream.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScript(TestNodeHealthService.java:154)
 testNodeHealthScriptShouldRun(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService)
   Time elapsed: 0 sec   ERROR!
 java.io.FileNotFoundException: 
 C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd
  (Access is denied)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at java.io.FileOutputStream.init(FileOutputStream.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScriptShouldRun(TestNodeHealthService.java:103)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1928) TestAMRMRPCNodeUpdates fails ocassionally

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968245#comment-13968245
 ] 

Hudson commented on YARN-1928:
--

FAILURE: Integrated in Hadoop-Yarn-trunk #540 (See 
[https://builds.apache.org/job/Hadoop-Yarn-trunk/540/])
YARN-1928. Fixed a race condition in TestAMRMRPCNodeUpdates which caused it to 
fail occassionally. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587114)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRMRPCNodeUpdates.java


 TestAMRMRPCNodeUpdates fails ocassionally
 -

 Key: YARN-1928
 URL: https://issues.apache.org/jira/browse/YARN-1928
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.4.1

 Attachments: YARN-1928.1.patch


 {code}
 junit.framework.AssertionFailedError: expected:0 but was:4
   at junit.framework.Assert.fail(Assert.java:50)
   at junit.framework.Assert.failNotEquals(Assert.java:287)
   at junit.framework.Assert.assertEquals(Assert.java:67)
   at junit.framework.Assert.assertEquals(Assert.java:199)
   at junit.framework.Assert.assertEquals(Assert.java:205)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:136)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1928) TestAMRMRPCNodeUpdates fails ocassionally

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968361#comment-13968361
 ] 

Hudson commented on YARN-1928:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1732 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1732/])
YARN-1928. Fixed a race condition in TestAMRMRPCNodeUpdates which caused it to 
fail occassionally. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587114)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRMRPCNodeUpdates.java


 TestAMRMRPCNodeUpdates fails ocassionally
 -

 Key: YARN-1928
 URL: https://issues.apache.org/jira/browse/YARN-1928
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.4.1

 Attachments: YARN-1928.1.patch


 {code}
 junit.framework.AssertionFailedError: expected:0 but was:4
   at junit.framework.Assert.fail(Assert.java:50)
   at junit.framework.Assert.failNotEquals(Assert.java:287)
   at junit.framework.Assert.assertEquals(Assert.java:67)
   at junit.framework.Assert.assertEquals(Assert.java:199)
   at junit.framework.Assert.assertEquals(Assert.java:205)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:136)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1933) TestAMRestart and TestNodeHealthService failing sometimes on Windows

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968359#comment-13968359
 ] 

Hudson commented on YARN-1933:
--

FAILURE: Integrated in Hadoop-Hdfs-trunk #1732 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1732/])
YARN-1933. Fixed test issues with TestAMRestart and TestNodeHealthService. 
Contributed by Jian He. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587104)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java


 TestAMRestart and TestNodeHealthService failing sometimes on Windows
 

 Key: YARN-1933
 URL: https://issues.apache.org/jira/browse/YARN-1933
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.4.1

 Attachments: YARN-1933.1.patch, YARN-1933.2.patch


 TestNodeHealthService failures:
 testNodeHealthScript(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService)
   Time elapsed: 1.405 sec   ERROR!
 java.io.FileNotFoundException: 
 C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd
  (The process cannot access the file because it is being used by another 
 process)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at java.io.FileOutputStream.init(FileOutputStream.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScript(TestNodeHealthService.java:154)
 testNodeHealthScriptShouldRun(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService)
   Time elapsed: 0 sec   ERROR!
 java.io.FileNotFoundException: 
 C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd
  (Access is denied)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at java.io.FileOutputStream.init(FileOutputStream.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScriptShouldRun(TestNodeHealthService.java:103)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1929:
---

Attachment: yarn-1929-1.patch

Here is a first-cut patch that removes unnecessary synchronization from 
EmbeddedElectorService, AdminService and CompositeService.

Thinking about the best way to write a unit test for this to avoid regressions 
in the future. We can may be override becomeActive to sleep for some time and 
try to shut the RM down. If it doesn't shutdown within a particular amount of 
time, fail the test? Any other ideas? 

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1928) TestAMRMRPCNodeUpdates fails ocassionally

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968413#comment-13968413
 ] 

Hudson commented on YARN-1928:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1757 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1757/])
YARN-1928. Fixed a race condition in TestAMRMRPCNodeUpdates which caused it to 
fail occassionally. Contributed by Zhijie Shen. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587114)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRMRPCNodeUpdates.java


 TestAMRMRPCNodeUpdates fails ocassionally
 -

 Key: YARN-1928
 URL: https://issues.apache.org/jira/browse/YARN-1928
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Zhijie Shen
Assignee: Zhijie Shen
 Fix For: 2.4.1

 Attachments: YARN-1928.1.patch


 {code}
 junit.framework.AssertionFailedError: expected:0 but was:4
   at junit.framework.Assert.fail(Assert.java:50)
   at junit.framework.Assert.failNotEquals(Assert.java:287)
   at junit.framework.Assert.assertEquals(Assert.java:67)
   at junit.framework.Assert.assertEquals(Assert.java:199)
   at junit.framework.Assert.assertEquals(Assert.java:205)
   at 
 org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:136)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1933) TestAMRestart and TestNodeHealthService failing sometimes on Windows

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968411#comment-13968411
 ] 

Hudson commented on YARN-1933:
--

SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1757 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1757/])
YARN-1933. Fixed test issues with TestAMRestart and TestNodeHealthService. 
Contributed by Jian He. (vinodkv: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1587104)
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeHealthService.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/applicationsmanager/TestAMRestart.java


 TestAMRestart and TestNodeHealthService failing sometimes on Windows
 

 Key: YARN-1933
 URL: https://issues.apache.org/jira/browse/YARN-1933
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jian He
Assignee: Jian He
 Fix For: 2.4.1

 Attachments: YARN-1933.1.patch, YARN-1933.2.patch


 TestNodeHealthService failures:
 testNodeHealthScript(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService)
   Time elapsed: 1.405 sec   ERROR!
 java.io.FileNotFoundException: 
 C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd
  (The process cannot access the file because it is being used by another 
 process)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at java.io.FileOutputStream.init(FileOutputStream.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScript(TestNodeHealthService.java:154)
 testNodeHealthScriptShouldRun(org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService)
   Time elapsed: 0 sec   ERROR!
 java.io.FileNotFoundException: 
 C:\Users\Administrator\Documents\hadoop-common\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\hadoop-yarn-server-nodemanager\target\org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService-localDir\failingscript.cmd
  (Access is denied)
   at java.io.FileOutputStream.open(Native Method)
   at java.io.FileOutputStream.init(FileOutputStream.java:221)
   at java.io.FileOutputStream.init(FileOutputStream.java:171)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.writeNodeHealthScriptFile(TestNodeHealthService.java:82)
   at 
 org.apache.hadoop.yarn.server.nodemanager.TestNodeHealthService.testNodeHealthScriptShouldRun(TestNodeHealthService.java:103)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968448#comment-13968448
 ] 

Tsuyoshi OZAWA commented on YARN-1929:
--

 Thinking about the best way to write a unit test for this to avoid 
 regressions in the future.

Your approach looks reasonable for me. In addition to overriding 
EES#becomeActive, we can override synchronized  methods or change their 
behaviour(CompositeService#stop,  AS#transitionToActive, RM#transitionToActive) 
to sleep with synchronization(a bit different, but like 
TestRetryCacheWithHA#DummyRetryInvocationHandler). Then we can reproduce 
deadlock situation step by step in test cases.

IMHO, we shouldn't touch ASE, because it's also used in NameNode HA.

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968458#comment-13968458
 ] 

Hadoop QA commented on YARN-1929:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12640071/yarn-1929-1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3566//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-YARN-Build/3566//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3566//console

This message is automatically generated.

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1337) Recover active container state upon nodemanager restart

2014-04-14 Thread Ravi Prakash (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravi Prakash updated YARN-1337:
---

Assignee: (was: Ravi Prakash)

 Recover active container state upon nodemanager restart
 ---

 Key: YARN-1337
 URL: https://issues.apache.org/jira/browse/YARN-1337
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe

 To support work-preserving NM restart we need to recover the state of the 
 containers that were active when the nodemanager went down.  This includes 
 informing the RM of containers that have exited in the interim and a strategy 
 for dealing with the exit codes from those containers along with how to 
 reacquire the active containers and determine their exit codes when they 
 terminate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968487#comment-13968487
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--

[~kasha], I think this problem looks very similar to YARN-1929 - deadlock after 
losing ZK session.

(*ASE#processResult* - *EES#becomeStandby* - *AS#transitionToStandby* - 
*RM#transitionToStandby*) and (RM#serviceStop - RM.super#serviceStop - 
*RM.super#stop* - AS#stop - *AS#serviceStop* - *EES#serviceStop* - 
*ASE#quitElection*)

IIUC, Karthik's patch on YARN-1929 partially solve this problem, but not 
completely. Please correct me if I get wrong. Thanks.

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Vinod Kumar Vavilapalli
Priority: Critical

 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1281) TestZKRMStateStoreZKClientConnections fails intermittently

2014-04-14 Thread Mit Desai (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968480#comment-13968480
 ] 

Mit Desai commented on YARN-1281:
-

[~kasha], are you still seeing this test failing?

 TestZKRMStateStoreZKClientConnections fails intermittently
 --

 Key: YARN-1281
 URL: https://issues.apache.org/jira/browse/YARN-1281
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Karthik Kambatla
Assignee: Karthik Kambatla

 The test fails intermittently - haven't been able to reproduce the failure 
 deterministically. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1355) Recover application ACLs upon nodemanager restart

2014-04-14 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-1355:


Assignee: Jason Lowe

 Recover application ACLs upon nodemanager restart
 -

 Key: YARN-1355
 URL: https://issues.apache.org/jira/browse/YARN-1355
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 The ACLs for applications need to be recovered for work-preserving 
 nodemanager restart.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1354) Recover applications upon nodemanager restart

2014-04-14 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-1354:


Assignee: Jason Lowe

 Recover applications upon nodemanager restart
 -

 Key: YARN-1354
 URL: https://issues.apache.org/jira/browse/YARN-1354
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 The set of active applications in the nodemanager context need to be 
 recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1352) Recover LogAggregationService upon nodemanager restart

2014-04-14 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-1352:


Assignee: Jason Lowe

 Recover LogAggregationService upon nodemanager restart
 --

 Key: YARN-1352
 URL: https://issues.apache.org/jira/browse/YARN-1352
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 LogAggregationService state needs to be recovered as part of the 
 work-preserving nodemanager restart feature.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (YARN-1337) Recover active container state upon nodemanager restart

2014-04-14 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe reassigned YARN-1337:


Assignee: Jason Lowe

 Recover active container state upon nodemanager restart
 ---

 Key: YARN-1337
 URL: https://issues.apache.org/jira/browse/YARN-1337
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: nodemanager
Affects Versions: 2.3.0
Reporter: Jason Lowe
Assignee: Jason Lowe

 To support work-preserving NM restart we need to recover the state of the 
 containers that were active when the nodemanager went down.  This includes 
 informing the RM of containers that have exited in the interim and a strategy 
 for dealing with the exit codes from those containers along with how to 
 reacquire the active containers and determine their exit codes when they 
 terminate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1281) TestZKRMStateStoreZKClientConnections fails intermittently

2014-04-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968501#comment-13968501
 ] 

Karthik Kambatla commented on YARN-1281:


Yes. Almost in every nightly run. I have been caught up with other things, and 
haven't been able to look into this. Temporarily marked it Unassigned so 
someone else can pick it. Will take it back when I get a chance to fix. 

 TestZKRMStateStoreZKClientConnections fails intermittently
 --

 Key: YARN-1281
 URL: https://issues.apache.org/jira/browse/YARN-1281
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Karthik Kambatla

 The test fails intermittently - haven't been able to reproduce the failure 
 deterministically. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1281) TestZKRMStateStoreZKClientConnections fails intermittently

2014-04-14 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1281:
---

Assignee: (was: Karthik Kambatla)

 TestZKRMStateStoreZKClientConnections fails intermittently
 --

 Key: YARN-1281
 URL: https://issues.apache.org/jira/browse/YARN-1281
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Karthik Kambatla

 The test fails intermittently - haven't been able to reproduce the failure 
 deterministically. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1939) Improve the packaging of AmIpFilter

2014-04-14 Thread Thomas Graves (JIRA)
Thomas Graves created YARN-1939:
---

 Summary: Improve the packaging of AmIpFilter
 Key: YARN-1939
 URL: https://issues.apache.org/jira/browse/YARN-1939
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, webapp
Affects Versions: 2.4.0
Reporter: Thomas Graves


It is recommended for applications to use the AmIpFilter to properly secure any 
WebUI that is specific to that application.  The packaging of the AmIpFilter is 
in org.apache.hadoop.yarn.server.webproxy.amfilter, which requires a 
application to pull in yarn-server as a dependency which isn't very user 
friendly for applications wanting to pick up the bare minimum.

We should improve the packaging of that so it can be pulled in independently.  
We do need to be careful to keep it backwards compatible atleast in the 2.x 
release line also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1861) Both RM stuck in standby mode when automatic failover is enabled

2014-04-14 Thread Tsuyoshi OZAWA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968520#comment-13968520
 ] 

Tsuyoshi OZAWA commented on YARN-1861:
--

Thank you for pointing, Karthik. I'll continue to check code again.

 Both RM stuck in standby mode when automatic failover is enabled
 

 Key: YARN-1861
 URL: https://issues.apache.org/jira/browse/YARN-1861
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Arpit Gupta
Assignee: Vinod Kumar Vavilapalli
Priority: Critical

 In our HA tests we noticed that the tests got stuck because both RM's got 
 into standby state and no one became active.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse

2014-04-14 Thread Xuan Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968521#comment-13968521
 ] 

Xuan Gong commented on YARN-1897:
-

[~mingma]
bq. For SignalContainerResponse, what is the semantics of isCMDCompleted? If we 
want to support synchronous signal container call and this flag indicates 
whether ContainerExecutor has signaled on the container successfully, that will 
require RM to wait for the response from NM after NM finishes the work; it 
implies ApplicationClientProtocol's signalContainer method will hold up a RPC 
handler for some period of time; we can have some time out or rate limiting on 
signalContainer call to make sure applications won't be able to consume all 
RM's RPC handlers. If isCMDCompleted means if the command has been submitted to 
RM successfully, then it is ok; or we can use exception to indicate failure of 
the request.

OK. We should try the best to do it asynchronously. We will reply on node 
heartbeat to send the container command to related NM. After NM executes the 
commands, it can send response(whether the cmd is finished successfully) back 
to RM with the node heartbeat, too. But this will bring us another questions. 
Because we can not control how much the NM need to execute the cmds and send 
back to RM, we can not give a detail time on how long the client should wait 
for the response. Also, we need to consider the RM Restart, RM Failover, etc. 

To make progress, i think that right now, check whether command has submitted 
to RM successfully (check whether container is exist or not, whether the 
container has already been kill, etc), might be fine for now. 

So,  keep isCMDCompleted in SignalContainerResponse ? What do you think ?

 Define SignalContainerRequest and SignalContainerResponse
 -

 Key: YARN-1897
 URL: https://issues.apache.org/jira/browse/YARN-1897
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Ming Ma

 We need to define SignalContainerRequest and SignalContainerResponse first as 
 they are needed by other sub tasks. SignalContainerRequest should use 
 OS-independent commands and provide a way to application to specify reason 
 for diagnosis. SignalContainerResponse might be empty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1939) Improve the packaging of AmIpFilter

2014-04-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968531#comment-13968531
 ] 

Thomas Graves commented on YARN-1939:
-

Looks like I was mistaken.  the web proxy is actually in its own jar.  closing 
this.

 Improve the packaging of AmIpFilter
 ---

 Key: YARN-1939
 URL: https://issues.apache.org/jira/browse/YARN-1939
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, webapp
Affects Versions: 2.4.0
Reporter: Thomas Graves

 It is recommended for applications to use the AmIpFilter to properly secure 
 any WebUI that is specific to that application.  The packaging of the 
 AmIpFilter is in org.apache.hadoop.yarn.server.webproxy.amfilter, which 
 requires a application to pull in yarn-server as a dependency which isn't 
 very user friendly for applications wanting to pick up the bare minimum.
 We should improve the packaging of that so it can be pulled in independently. 
  We do need to be careful to keep it backwards compatible atleast in the 2.x 
 release line also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1939) Improve the packaging of AmIpFilter

2014-04-14 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved YARN-1939.
-

Resolution: Invalid

 Improve the packaging of AmIpFilter
 ---

 Key: YARN-1939
 URL: https://issues.apache.org/jira/browse/YARN-1939
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: api, webapp
Affects Versions: 2.4.0
Reporter: Thomas Graves

 It is recommended for applications to use the AmIpFilter to properly secure 
 any WebUI that is specific to that application.  The packaging of the 
 AmIpFilter is in org.apache.hadoop.yarn.server.webproxy.amfilter, which 
 requires a application to pull in yarn-server as a dependency which isn't 
 very user friendly for applications wanting to pick up the bare minimum.
 We should improve the packaging of that so it can be pulled in independently. 
  We do need to be careful to keep it backwards compatible atleast in the 2.x 
 release line also.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1897) Define SignalContainerRequest and SignalContainerResponse

2014-04-14 Thread Ming Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968544#comment-13968544
 ] 

Ming Ma commented on YARN-1897:
---

Sounds good. How about IsCMDSubmissionCompleted?

 Define SignalContainerRequest and SignalContainerResponse
 -

 Key: YARN-1897
 URL: https://issues.apache.org/jira/browse/YARN-1897
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: api
Reporter: Ming Ma

 We need to define SignalContainerRequest and SignalContainerResponse first as 
 they are needed by other sub tasks. SignalContainerRequest should use 
 OS-independent commands and provide a way to application to specify reason 
 for diagnosis. SignalContainerResponse might be empty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-435) Make it easier to access cluster topology information in an AM

2014-04-14 Thread Bikas Saha (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968558#comment-13968558
 ] 

Bikas Saha commented on YARN-435:
-

Pasting the description from YARN-1722 that was closed as a dup of this.
{code}There is no way for an AM to find out the names of all the nodes in the 
cluster via the AMRMProtocol. An AM can only at best ask for containers at * 
location. The only way to get that information is via the ClientRMProtocol but 
that is secured by Kerberos or RMDelegationToken while the AM has an AMRMToken. 
This is a pretty important piece of missing functionality. There are other 
jiras opened about getting cluster topology etc. but they havent been addressed 
due to a clear definition of cluster topology perhaps. Adding a means to at 
least get the node information would be a good first step.{code}
This jira may have stalled in trying to figure out how to layout the topology. 
YARN-1722 simply asks for the list of nodes in the cluster. While defining a 
way to generically describe topology may be tricky - all such methods must list 
all the nodes in the cluster. So YARN-1722 is a much simpler problem. Whatever 
object we define for topology, can start with a simple list of nodes and then 
use the integer id of the nodes (for compaction) in the list to reference them 
in the other objects describing the hierarchy.

 Make it easier to access cluster topology information in an AM
 --

 Key: YARN-435
 URL: https://issues.apache.org/jira/browse/YARN-435
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Hitesh Shah
Assignee: Omkar Vinit Joshi

 ClientRMProtocol exposes a getClusterNodes api that provides a report on all 
 nodes in the cluster including their rack information. 
 However, this requires the AM to open and establish a separate connection to 
 the RM in addition to one for the AMRMProtocol. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968653#comment-13968653
 ] 

Steve Loughran commented on YARN-1929:
--

I'm +1 to the change to composite service, as well as making the serviceXYZ 
operations desyncrhonized (the state entry point in the public method is 
synchronized to prevent re-entrancy.

I'll leave it to others to look at the remaining code and comment

Now, there is one little quirk by desynchronizing the serviceStart() and 
serviceStop methods. Although it is still impossible to have 1 thread 
successfully entering either method, there is the sequence
{code}

Thread 1 : service.start()
Thread 1:  service.serviceStart() begins

Thread 2 : service.stop()
Thread 2:  service.serviceStop() begins
Thread 2:  service.serviceStop() completes

Thread 1: service start completes
{code}

That's because we're not making any attempt to include transitive states, it 
generally makes things too complex -and that includes handling the problem of 
what is the policy if I try to call stop midway through starting

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol

2014-04-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968659#comment-13968659
 ] 

Jian He commented on YARN-1879:
---

RetryCache is used to handle retries at *RPC level* to return the previous 
response for duplicate non-idempotent requests. Allocate() call already has a 
similar retry cache mechanism by checking the request Id of the request, but 
that also serves the *App level* retry.  [~vinodkv], is it fine to keep both?

some comments on the patch:
testAPIsWithRetryCache:
- Assert exception type inside catch block.
{code}
  } catch (InvalidApplicationMasterRequestException e) {
  }
} catch (InvalidApplicationMasterRequestException e) {
// InvalidApplicationMasterRequestException is thrown
// after expiring RetryCache
  }
{code}
- TestNamenodeRetryCache class comment is useful for  
TestApplicationMasterServiceRetryCache too, we can copy that over.
- org.junit.Assert.assertEquals - assertEquals

some suggestions on the configs to conform with existing YarnConfigs:
- RM_APPMASTER_ENABLE_RETRY_CACHE_KEY - RM_RETRY_CACHE_ENABLED
-  Remove the APPMASTER in the config name so that the same config can be used 
for other service later on?
- Use RM_PREFIX instead of YARN_PREFIX;
- RM_APPMASTER_ENABLE_RETRY_CACHE_DEFAULT- DEFAULT_RM_RETRY_CACHE_ENABLED to 
conform with the naming of other default configs.
- RM_APPMASTER_RETRY_CACHE_HEAP_PERCENT_KEY - RM_RETRY_CACHE_HEAP_PERCENT
- RM_APPMASTER_RETRY_CACHE_EXPIRYTIME_MILLIS_KEY-RM_RETRY_CACHE_EXPIRY_MS

 Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
 ---

 Key: YARN-1879
 URL: https://issues.apache.org/jira/browse/YARN-1879
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager
Reporter: Jian He
Assignee: Tsuyoshi OZAWA
Priority: Critical
 Attachments: YARN-1879.1.patch, YARN-1879.1.patch, 
 YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, YARN-1879.4.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1929:
---

Attachment: yarn-1929-2.patch

Here is a new patch that adds a test.

Thanks Steve for taking a look at the patch and for the offline input on 
removing the synchronization from CompositeService#stop.

[~jianhe], [~xgong] - will either of you be able to take a look at the patch?

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch, yarn-1929-2.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-558) Add ability to completely remove nodemanager from resourcemanager.

2014-04-14 Thread Christian Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968800#comment-13968800
 ] 

Christian Smith commented on YARN-558:
--

+1 For this as well.  

This is essential for any public (or private) cloud scenario where one wants 
elastic clusters.

 Add ability to completely remove nodemanager from resourcemanager.
 --

 Key: YARN-558
 URL: https://issues.apache.org/jira/browse/YARN-558
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: resourcemanager
Reporter: Garth Goodson
Priority: Minor
  Labels: feature

 I would like to add the ability to completely remove a nodemanager from the 
 resourcemanager's state.
 I run a cloud service where I want to dynamically bring up nodes to act as 
 nodemanagers and then bring them down again when not needed.  These nodes 
 have dynamically assigned IPs, thus the alternative of decommissioning them 
 via an excludes file leads to a large (unbounded) list of decommissioned 
 nodes that may never be commissioned again. I would like the ability to move 
 a node from a decommissioned state to completely removing it from the 
 resource manager.
 I have thought of two ways of implementing this.
 1) Add an optional timeout between the decommission state - being removed 
 from the nodemanager.
 2) Add an explicit RPC to remove a node that is decommissioned.
 Any additional thoughts/discussion are welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968808#comment-13968808
 ] 

Karthik Kambatla commented on YARN-1929:


bq. Now, there is one little quirk by desynchronizing the serviceStart() and 
serviceStop methods. Although it is still impossible to have 1 thread 
successfully entering either method, there is the sequence
AbstractService.stateChangeLock appears to allow a single thread to make any 
state change at a given point in time. In the example, Thread 2's stop would 
wait for Thread1's start() to complete. No? 

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch, yarn-1929-2.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1927) Preemption message shouldn’t be created multiple times for same container-id in ProportionalCapacityPreemptionPolicy

2014-04-14 Thread Carlo Curino (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968830#comment-13968830
 ] 

Carlo Curino commented on YARN-1927:


I agree with Chris's explanation, we are sustaining an ask, sort of 
continuously publishing our latest belief about preemption (e.g., a massive job 
could release lots of resources, and relief the pressure that caused us to ask 
containers back)... The choice of how to propagate this information to the AM 
is somewhat irrelevant. The current solution tries to be more or less aligned 
with the general style of protocols and it has a couple of nice properties: 1) 
losing a message is not a problem, and 2) it is simple for the AM to 
reconstruct the latest RM opinion about resource preemption (simply re-stated 
in each message). Repeating the ask, confirms to the AM that the need for 
preemption is still there... this does convey some useful information to the 
AM. Given the scale/frequency of operations this doesn't seem to be a perf 
concern either. It is a matter of taste whether one likes better sustaining an 
ask or start-stopping it.

Since this is a public protocol and I would suggest we consider to change it 
only if there is a substantial gain in expressivity or performance... I don't 
see one at the moment (but I might be missing your point). What you propose 
seems a plausible alternative, but not substantially better than what's already 
there, so I would lean towards leaving it be.  



 Preemption message shouldn’t be created multiple times for same container-id 
 in ProportionalCapacityPreemptionPolicy
 

 Key: YARN-1927
 URL: https://issues.apache.org/jira/browse/YARN-1927
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.4.0
Reporter: Wangda Tan
Assignee: Wangda Tan
Priority: Minor
 Attachments: YARN-1927.patch


 Currently, after each editSchedule() called, preemption message will be 
 created and sent to scheduler. ProportionalCapacityPreemptionPolicy should 
 only send preemption message once for each container.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (YARN-1940) deleteAsUser() terminates early without deleting more files on error

2014-04-14 Thread Kihwal Lee (JIRA)
Kihwal Lee created YARN-1940:


 Summary: deleteAsUser() terminates early without deleting more 
files on error
 Key: YARN-1940
 URL: https://issues.apache.org/jira/browse/YARN-1940
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Kihwal Lee


In container-executor.c, delete_path() returns early when unlink() against a 
file or a symlink fails. We have seen many cases of the error being ENOENT, 
which can safely be ignored during delete.  

This is what we saw recently: An app mistakenly created a large number of files 
in the local directory and the deletion service failed to delete a significant 
portion of them due to this bug. Repeatedly hitting this on the same node led 
to exhaustion of inodes in one of the partitions.

Beside ignoring ENOENT,  delete_path() can simply skip the failed one and 
continue in some cases, rather than aborting and leaving files behind.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968874#comment-13968874
 ] 

Hadoop QA commented on YARN-1929:
-

{color:green}+1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12640136/yarn-1929-2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 1 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-common-project/hadoop-common 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3567//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3567//console

This message is automatically generated.

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch, yarn-1929-2.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1190) enabling uber mode with 0 reducer still requires mapreduce.reduce.memory.mb to be less than yarn.app.mapreduce.am.resource.mb

2014-04-14 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li resolved YARN-1190.
---

Resolution: Duplicate
  Assignee: Siqi Li

 enabling uber mode with 0 reducer still requires mapreduce.reduce.memory.mb 
 to be less than yarn.app.mapreduce.am.resource.mb
 -

 Key: YARN-1190
 URL: https://issues.apache.org/jira/browse/YARN-1190
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.1.1-beta, 2.0.6-alpha
Reporter: Siqi Li
Assignee: Siqi Li
Priority: Minor
 Attachments: YARN-1190_v1.patch.txt


 Since there is no reducer, the memory allocated to reducer is irrelevant to 
 enable uber mode of a job



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1902) Allocation of too many containers when a second request is done with the same resource capability

2014-04-14 Thread Sietse T. Au (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sietse T. Au updated YARN-1902:
---

Affects Version/s: 2.4.0

 Allocation of too many containers when a second request is done with the same 
 resource capability
 -

 Key: YARN-1902
 URL: https://issues.apache.org/jira/browse/YARN-1902
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 2.2.0, 2.3.0, 2.4.0
Reporter: Sietse T. Au
  Labels: patch
 Attachments: YARN-1902.patch, YARN-1902.v2.patch, YARN-1902.v3.patch


 Regarding AMRMClientImpl
 Scenario 1:
 Given a ContainerRequest x with Resource y, when addContainerRequest is 
 called z times with x, allocate is called and at least one of the z allocated 
 containers is started, then if another addContainerRequest call is done and 
 subsequently an allocate call to the RM, (z+1) containers will be allocated, 
 where 1 container is expected.
 Scenario 2:
 No containers are started between the allocate calls. 
 Analyzing debug logs of the AMRMClientImpl, I have found that indeed a (z+1) 
 are requested in both scenarios, but that only in the second scenario, the 
 correct behavior is observed.
 Looking at the implementation I have found that this (z+1) request is caused 
 by the structure of the remoteRequestsTable. The consequence of MapResource, 
 ResourceRequestInfo is that ResourceRequestInfo does not hold any 
 information about whether a request has been sent to the RM yet or not.
 There are workarounds for this, such as releasing the excess containers 
 received.
 The solution implemented is to initialize a new ResourceRequest in 
 ResourceRequestInfo when a request has been successfully sent to the RM.
 The patch includes a test in which scenario one is tested.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1391) Lost node list should be identify by NodeId

2014-04-14 Thread Siqi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968906#comment-13968906
 ] 

Siqi Li commented on YARN-1391:
---

I think this patch can still be applied to the trunk

 Lost node list should be identify by NodeId
 ---

 Key: YARN-1391
 URL: https://issues.apache.org/jira/browse/YARN-1391
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.5-alpha
Reporter: Siqi Li
Assignee: Siqi Li
 Attachments: YARN-1391.v1.patch


 in case of multiple node managers on a single machine. each of them should be 
 identified by NodeId, which is more unique than just host name



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (YARN-1415) In scheduler UI, including used memory in Memory Total seems to be inaccurate

2014-04-14 Thread Siqi Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siqi Li resolved YARN-1415.
---

Resolution: Not a Problem

 In scheduler UI, including used memory in Memory Total seems to be 
 inaccurate
 ---

 Key: YARN-1415
 URL: https://issues.apache.org/jira/browse/YARN-1415
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, scheduler
Reporter: Siqi Li
 Fix For: 2.1.0-beta

 Attachments: 1.png, 2.png


 Memory Total is currently a sum of availableMB, allocatedMB, and 
 reservedMB. 
 It seems that the term availableMB actually means total memory, since it 
 doesn't get decreased when some jobs use a certain amount of memory.
 Hence, the Memory Total should not include allocatedMB, or availableMB 
 doesn't get updated properly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1414) with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs

2014-04-14 Thread Siqi Li (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968909#comment-13968909
 ] 

Siqi Li commented on YARN-1414:
---

For now, this patch is able to update leaf queues and root queue correctly, Can 
someone take a look at this and merge it into the latest branch ?

 with Fair Scheduler reserved MB in WebUI is leaking when killing waiting jobs
 -

 Key: YARN-1414
 URL: https://issues.apache.org/jira/browse/YARN-1414
 Project: Hadoop YARN
  Issue Type: Sub-task
  Components: resourcemanager, scheduler
Affects Versions: 2.0.5-alpha
Reporter: Siqi Li
Assignee: Siqi Li
 Fix For: 2.2.0

 Attachments: YARN-1221-subtask.v1.patch.txt, YARN-1221-v2.patch






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1929) DeadLock in RM when automatic failover is enabled.

2014-04-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968966#comment-13968966
 ] 

Jian He commented on YARN-1929:
---

patch looks good to me, wait for others to also take a look.

 DeadLock in RM when automatic failover is enabled.
 --

 Key: YARN-1929
 URL: https://issues.apache.org/jira/browse/YARN-1929
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
 Environment: Yarn HA cluster
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: yarn-1929-1.patch, yarn-1929-2.patch


 Dead lock detected  in RM when automatic failover is enabled.
 {noformat}
 Found one Java-level deadlock:
 =
 Thread-2:
   waiting to lock monitor 0x7fb514303cf0 (object 0xef153fd0, a 
 org.apache.hadoop.ha.ActiveStandbyElector),
   which is held by main-EventThread
 main-EventThread:
   waiting to lock monitor 0x7fb514750a48 (object 0xef154020, a 
 org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService),
   which is held by Thread-2
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968965#comment-13968965
 ] 

Karthik Kambatla commented on YARN-1934:


Ideally, we should guard against accessing zkClient with a getZkClient() method 
that attempts a new connection if zkClient is null.

 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: RM.txt


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968967#comment-13968967
 ] 

Karthik Kambatla commented on YARN-1934:


Or, move all uses to runWithCheck().

 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: RM.txt


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968980#comment-13968980
 ] 

Jian He commented on YARN-1934:
---

I think we should have all uses of zkClient with runWithCheck()

 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: RM.txt


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Karthik Kambatla (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968987#comment-13968987
 ] 

Karthik Kambatla commented on YARN-1934:


Working on a patch, will try to post something later today or tomorrow. 

 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Critical
 Attachments: RM.txt


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1934:
---

Priority: Blocker  (was: Critical)

 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: RM.txt


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Karthik Kambatla (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karthik Kambatla updated YARN-1934:
---

Attachment: yarn-1934-0.patch

Here is a patch that replaces all direct references to zkClient and uses 
runWithCheck instead. 

ZKRMStateStore has become very unwieldy and hard to manage. We should 
definitely clean it up. Thinking about it, I think we should just have have 
RMZooKeeper and/or RMFencingZooKeeper classes that extend/wrap ZooKeeper and 
override the methods we need. That would make the remaining ZK code much easier 
to read and maintain. Would like to work on this on a separate JIRA target for 
2.5.

Haven't added a test in the patch. I am tempted not to, given the intention to 
cleanup/revamp the ZK interactions. Can add one if insisted. 



 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: RM.txt, yarn-1934-0.patch


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969092#comment-13969092
 ] 

Hadoop QA commented on YARN-1934:
-

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12640175/yarn-1934-0.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:red}-1 tests included{color}.  The patch doesn't appear to include 
any new or modified tests.
Please justify why no new tests are needed for this 
patch.
Also please list what manual steps were performed to 
verify this patch.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/3568//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3568//console

This message is automatically generated.

 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: RM.txt, yarn-1934-0.patch


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (YARN-1934) Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.

2014-04-14 Thread Rohith (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969225#comment-13969225
 ] 

Rohith commented on YARN-1934:
--

+1 patch looks good to me :-)

 Potential NPE in ZKRMStateStore caused by handling Disconnected event from ZK.
 --

 Key: YARN-1934
 URL: https://issues.apache.org/jira/browse/YARN-1934
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.4.0
Reporter: Rohith
Assignee: Karthik Kambatla
Priority: Blocker
 Attachments: RM.txt, yarn-1934-0.patch


 For ZK disconnected event , zkClient is set to null. It is very much prone to 
 throw NPE.
 {noformat}
 case Disconnected:
   LOG.info(ZKRMStateStore Session disconnected);
   oldZkClient = zkClient;
   zkClient = null;
   break;
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)