[jira] [Commented] (YARN-181) capacity-scheduler.xml move breaks Eclipse import

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483194#comment-13483194
 ] 

Hudson commented on YARN-181:
-

Integrated in Hadoop-Hdfs-trunk #1205 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1205/])
YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via 
YARN-140. Contributed by Siddharth Seth. (Revision 1401504)

 Result = FAILURE
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504
Files : 
* 
/hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml


 capacity-scheduler.xml move breaks Eclipse import
 -

 Key: YARN-181
 URL: https://issues.apache.org/jira/browse/YARN-181
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Critical
 Fix For: 2.0.3-alpha

 Attachments: YARN181_jenkins.txt, YARN181_postSvnMv.txt, 
 YARN181_svn_mv.sh


 Eclipse doesn't seem to handle testResources which resolve to an absolute 
 path. YARN-140 moved capacity-scheduler.cfg a couple of levels up to the 
 hadoop-yarn project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-140) Add capacity-scheduler-default.xml to provide a default set of configurations for the capacity scheduler.

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483196#comment-13483196
 ] 

Hudson commented on YARN-140:
-

Integrated in Hadoop-Hdfs-trunk #1205 (See 
[https://builds.apache.org/job/Hadoop-Hdfs-trunk/1205/])
YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via 
YARN-140. Contributed by Siddharth Seth. (Revision 1401504)

 Result = FAILURE
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504
Files : 
* 
/hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml


 Add capacity-scheduler-default.xml to provide a default set of configurations 
 for the capacity scheduler.
 -

 Key: YARN-140
 URL: https://issues.apache.org/jira/browse/YARN-140
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Ahmed Radwan
Assignee: Ahmed Radwan
 Fix For: 2.0.3-alpha

 Attachments: YARN-140.patch, YARN-140_rev2.patch, 
 YARN-140_rev3.patch, YARN-140_rev4.patch, YARN-140_rev5_onlyForJenkins.patch, 
 YARN-140_rev5.patch, YARN-140_rev5_svn_mv.patch, 
 YARN-140_rev6_onlyForJenkins.patch, YARN-140_rev6.patch, 
 YARN-140_rev7_onlyForJenkins.patch, YARN-140_rev8_onlyForJenkins.patch, 
 YARN-140_rev9.patch, YARN-140_rev9_svn_mv.patch


 When setting up the capacity scheduler users are faced with problems like:
 {code}
 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
 starting ResourceManager
 java.lang.IllegalArgumentException: Illegal capacity of -1 for queue root
 {code}
 Which basically arises from missing basic configurations, which in many 
 cases, there is no need to explicitly provide, and a default configuration 
 will be sufficient. For example, to address the error above, the user need to 
 add a capacity of 100 to the root queue.
 So, we need to add a capacity-scheduler-default.xml, this will be helpful to 
 provide the basic set of default configurations required to run the capacity 
 scheduler. The user can still override existing configurations or provide new 
 ones in capacity-scheduler.xml. This is similar to *-default.xml vs 
 *-site.xml for yarn, core, mapred, hdfs, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-177) CapacityScheduler - adding a queue while the RM is running has wacky results

2012-10-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483211#comment-13483211
 ] 

Thomas Graves commented on YARN-177:


+1. Thanks Arun!  I'll commit this shortly.

 CapacityScheduler - adding a queue while the RM is running has wacky results
 

 Key: YARN-177
 URL: https://issues.apache.org/jira/browse/YARN-177
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.3
Reporter: Thomas Graves
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 2.0.3-alpha, 0.23.5

 Attachments: YARN-177.patch, YARN-177.patch, YARN-177.patch, 
 YARN-177.patch


 Adding a queue to the capacity scheduler while the RM is running and then 
 running a job in the queue added results in very strange behavior.  The 
 cluster Total Memory can either decrease or increase.  We had a cluster where 
 total memory decreased to almost 1/6th the capacity. Running on a small test 
 cluster resulted in the capacity going up by simply adding a queue and 
 running wordcount.  
 Looking at the RM logs, used memory can go negative but other logs show the 
 number positive:
 2012-10-21 22:56:44,796 [ResourceManager Event Processor] INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 assignedContainer queue=root usedCapacity=0.0375 absoluteUsedCapacity=0.0375 
 used=memory: 7680 cluster=memory: 204800
 2012-10-21 22:56:45,831 [ResourceManager Event Processor] INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 completedContainer queue=root usedCapacity=-0.0225 
 absoluteUsedCapacity=-0.0225 used=memory: -4608 cluster=memory: 204800
   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-177) CapacityScheduler - adding a queue while the RM is running has wacky results

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483221#comment-13483221
 ] 

Hudson commented on YARN-177:
-

Integrated in Hadoop-trunk-Commit #2920 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/2920/])
YARN-177. CapacityScheduler - adding a queue while the RM is running has 
wacky results (acmurthy vai tgraves) (Revision 1401668)

 Result = SUCCESS
tgraves : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401668
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java


 CapacityScheduler - adding a queue while the RM is running has wacky results
 

 Key: YARN-177
 URL: https://issues.apache.org/jira/browse/YARN-177
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.3
Reporter: Thomas Graves
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 2.0.3-alpha, 0.23.5

 Attachments: YARN-177.patch, YARN-177.patch, YARN-177.patch, 
 YARN-177.patch


 Adding a queue to the capacity scheduler while the RM is running and then 
 running a job in the queue added results in very strange behavior.  The 
 cluster Total Memory can either decrease or increase.  We had a cluster where 
 total memory decreased to almost 1/6th the capacity. Running on a small test 
 cluster resulted in the capacity going up by simply adding a queue and 
 running wordcount.  
 Looking at the RM logs, used memory can go negative but other logs show the 
 number positive:
 2012-10-21 22:56:44,796 [ResourceManager Event Processor] INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 assignedContainer queue=root usedCapacity=0.0375 absoluteUsedCapacity=0.0375 
 used=memory: 7680 cluster=memory: 204800
 2012-10-21 22:56:45,831 [ResourceManager Event Processor] INFO 
 org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: 
 completedContainer queue=root usedCapacity=-0.0225 
 absoluteUsedCapacity=-0.0225 used=memory: -4608 cluster=memory: 204800
   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-181) capacity-scheduler.xml move breaks Eclipse import

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483239#comment-13483239
 ] 

Hudson commented on YARN-181:
-

Integrated in Hadoop-Mapreduce-trunk #1235 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1235/])
YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via 
YARN-140. Contributed by Siddharth Seth. (Revision 1401504)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504
Files : 
* 
/hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml


 capacity-scheduler.xml move breaks Eclipse import
 -

 Key: YARN-181
 URL: https://issues.apache.org/jira/browse/YARN-181
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.0.2-alpha
Reporter: Siddharth Seth
Assignee: Siddharth Seth
Priority: Critical
 Fix For: 2.0.3-alpha

 Attachments: YARN181_jenkins.txt, YARN181_postSvnMv.txt, 
 YARN181_svn_mv.sh


 Eclipse doesn't seem to handle testResources which resolve to an absolute 
 path. YARN-140 moved capacity-scheduler.cfg a couple of levels up to the 
 hadoop-yarn project.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-140) Add capacity-scheduler-default.xml to provide a default set of configurations for the capacity scheduler.

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483241#comment-13483241
 ] 

Hudson commented on YARN-140:
-

Integrated in Hadoop-Mapreduce-trunk #1235 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1235/])
YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via 
YARN-140. Contributed by Siddharth Seth. (Revision 1401504)

 Result = SUCCESS
vinodkv : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504
Files : 
* 
/hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml


 Add capacity-scheduler-default.xml to provide a default set of configurations 
 for the capacity scheduler.
 -

 Key: YARN-140
 URL: https://issues.apache.org/jira/browse/YARN-140
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Ahmed Radwan
Assignee: Ahmed Radwan
 Fix For: 2.0.3-alpha

 Attachments: YARN-140.patch, YARN-140_rev2.patch, 
 YARN-140_rev3.patch, YARN-140_rev4.patch, YARN-140_rev5_onlyForJenkins.patch, 
 YARN-140_rev5.patch, YARN-140_rev5_svn_mv.patch, 
 YARN-140_rev6_onlyForJenkins.patch, YARN-140_rev6.patch, 
 YARN-140_rev7_onlyForJenkins.patch, YARN-140_rev8_onlyForJenkins.patch, 
 YARN-140_rev9.patch, YARN-140_rev9_svn_mv.patch


 When setting up the capacity scheduler users are faced with problems like:
 {code}
 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error 
 starting ResourceManager
 java.lang.IllegalArgumentException: Illegal capacity of -1 for queue root
 {code}
 Which basically arises from missing basic configurations, which in many 
 cases, there is no need to explicitly provide, and a default configuration 
 will be sufficient. For example, to address the error above, the user need to 
 add a capacity of 100 to the root queue.
 So, we need to add a capacity-scheduler-default.xml, this will be helpful to 
 provide the basic set of default configurations required to run the capacity 
 scheduler. The user can still override existing configurations or provide new 
 ones in capacity-scheduler.xml. This is similar to *-default.xml vs 
 *-site.xml for yarn, core, mapred, hdfs, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-179) Bunch of test failures on trunk

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483242#comment-13483242
 ] 

Hudson commented on YARN-179:
-

Integrated in Hadoop-Mapreduce-trunk #1235 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1235/])
YARN-179. Fix some unit test failures. (Contributed by Vinod Kumar 
Vavilapalli) (Revision 1401481)

 Result = SUCCESS
sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401481
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/pom.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/TestUnmanagedAMLauncher.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java


 Bunch of test failures on trunk
 ---

 Key: YARN-179
 URL: https://issues.apache.org/jira/browse/YARN-179
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 2.0.2-alpha
Reporter: Vinod Kumar Vavilapalli
Assignee: Vinod Kumar Vavilapalli
Priority: Blocker
 Fix For: 2.0.3-alpha

 Attachments: YARN-179-20121022.3.txt, YARN-179-20121022.4.txt


 {{CapacityScheduler.setConf()}} mandates a YarnConfiguration. It doesn't need 
 to, throughout all of YARN, components only depend on Configuration and 
 depend on the callers to provide correct configuration.
 This is causing multiple tests to fail.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Moved] (YARN-185) Add preemption to CS

2012-10-24 Thread Arun C Murthy (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arun C Murthy moved MAPREDUCE-3938 to YARN-185:
---

Component/s: (was: mrv2)
Key: YARN-185  (was: MAPREDUCE-3938)
Project: Hadoop YARN  (was: Hadoop Map/Reduce)

 Add preemption to CS
 

 Key: YARN-185
 URL: https://issues.apache.org/jira/browse/YARN-185
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Arun C Murthy
Assignee: Arun C Murthy

 Umbrella jira to track adding preemption to CS, let's track via sub-tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-147) Add support for CPU isolation/monitoring of containers

2012-10-24 Thread Arun C Murthy (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483255#comment-13483255
 ] 

Arun C Murthy commented on YARN-147:


Cancelling patch while comments are addressed, particularly the one Sid raised 
- we can't break LCE.

Also, we need to make sure this continues to work on RHEL5/CentOS5 which 
doesn't have cgroups.

One more thing - can we please do reviews/discussions on YARN-3 to ensure we 
keep track in one place? Thanks.

 Add support for CPU isolation/monitoring of containers
 --

 Key: YARN-147
 URL: https://issues.apache.org/jira/browse/YARN-147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Andrew Ferguson
 Fix For: 2.0.3-alpha

 Attachments: YARN-147-v1.patch, YARN-147-v2.patch, YARN-147-v3.patch, 
 YARN-147-v4.patch, YARN-147-v5.patch, YARN-3.patch


 This is a clone for YARN-3 to be able to submit the patch as YARN-3 does not 
 show the SUBMIT PATCH button.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-180) Capacity scheduler - containers that get reserved create container token to early

2012-10-24 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483259#comment-13483259
 ] 

Robert Joseph Evans commented on YARN-180:
--

Thanks for the review Tom, I'll check it in now.  Also the port to 0.23 looks 
clean, a simple refactoring, so +1 for that too.

 Capacity scheduler - containers that get reserved create container token to 
 early
 -

 Key: YARN-180
 URL: https://issues.apache.org/jira/browse/YARN-180
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.3
Reporter: Thomas Graves
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 2.0.3-alpha, 0.23.5

 Attachments: YARN-180-branch_0.23.patch, YARN-180.patch, 
 YARN-180.patch, YARN-180.patch


 The capacity scheduler has the ability to 'reserve' containers.  
 Unfortunately before it decides that it goes to reserved rather then 
 assigned, the Container object is created which creates a container token 
 that expires in roughly 10 minutes by default.  
 This means that by the time the NM frees up enough space on that node for the 
 container to move to assigned the container token may have expired.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-178) Fix custom ProcessTree instance creation

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483262#comment-13483262
 ] 

Hudson commented on YARN-178:
-

Integrated in Hadoop-trunk-Commit #2921 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/2921/])
YARN-178. Fix custom ProcessTree instance creation (Radim Kolar via bobby) 
(Revision 1401698)

 Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401698
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java


 Fix custom ProcessTree instance creation
 

 Key: YARN-178
 URL: https://issues.apache.org/jira/browse/YARN-178
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0, 0.23.5
Reporter: Radim Kolar
Assignee: Radim Kolar
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: pstree-instance2.txt, pstree-instance.txt


 1. In current pluggable resourcecalculatorprocesstree is not passed root 
 process id to custom implementation making it unusable.
 2. pstree do not extend Configured as it should
 Added constructor with pid argument with testsuite. Also added test that 
 pstree is correctly configured.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-3) Add support for CPU isolation/monitoring of containers

2012-10-24 Thread Andrew Ferguson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483266#comment-13483266
 ] 

Andrew Ferguson commented on YARN-3:


(replying to comments on YARN-147 here instead as per [~acmurthy]'s request)

thanks for catching that bug [~sseth]! I've updated my git repo [1], and will 
post a new patch after addressing the review from [~vinodkone]. I successfully 
tested it quite a bit with and without cgroups back in the summer, but it seems 
the patch has shifted enough since the testing that I should do it again.

[1] https://github.com/adferguson/hadoop-common/commits/adf-yarn-147

 Add support for CPU isolation/monitoring of containers
 --

 Key: YARN-3
 URL: https://issues.apache.org/jira/browse/YARN-3
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Andrew Ferguson
 Attachments: mapreduce-4334-design-doc.txt, 
 mapreduce-4334-design-doc-v2.txt, MAPREDUCE-4334-executor-v1.patch, 
 MAPREDUCE-4334-executor-v2.patch, MAPREDUCE-4334-executor-v3.patch, 
 MAPREDUCE-4334-executor-v4.patch, MAPREDUCE-4334-pre1.patch, 
 MAPREDUCE-4334-pre2.patch, MAPREDUCE-4334-pre2-with_cpu.patch, 
 MAPREDUCE-4334-pre3.patch, MAPREDUCE-4334-pre3-with_cpu.patch, 
 MAPREDUCE-4334-v1.patch, MAPREDUCE-4334-v2.patch, YARN-3-lce_only-v1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-147) Add support for CPU isolation/monitoring of containers

2012-10-24 Thread Andrew Ferguson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483268#comment-13483268
 ] 

Andrew Ferguson commented on YARN-147:
--

hi [~acmurthy], I've started posting replies on YARN-3 instead. the LCE bug is 
fixed and I'll post a new patch after addressing [~vinodkv]'s comments. thanks!

 Add support for CPU isolation/monitoring of containers
 --

 Key: YARN-147
 URL: https://issues.apache.org/jira/browse/YARN-147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Andrew Ferguson
 Fix For: 2.0.3-alpha

 Attachments: YARN-147-v1.patch, YARN-147-v2.patch, YARN-147-v3.patch, 
 YARN-147-v4.patch, YARN-147-v5.patch, YARN-3.patch


 This is a clone for YARN-3 to be able to submit the patch as YARN-3 does not 
 show the SUBMIT PATCH button.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-180) Capacity scheduler - containers that get reserved create container token to early

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483270#comment-13483270
 ] 

Hudson commented on YARN-180:
-

Integrated in Hadoop-trunk-Commit #2922 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/2922/])
YARN-180. Capacity scheduler - containers that get reserved create 
container token to early (acmurthy and bobby) (Revision 1401703)

 Result = SUCCESS
bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401703
Files : 
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java


 Capacity scheduler - containers that get reserved create container token to 
 early
 -

 Key: YARN-180
 URL: https://issues.apache.org/jira/browse/YARN-180
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Affects Versions: 0.23.3
Reporter: Thomas Graves
Assignee: Arun C Murthy
Priority: Critical
 Fix For: 3.0.0, 2.0.3-alpha, 0.23.5

 Attachments: YARN-180-branch_0.23.patch, YARN-180.patch, 
 YARN-180.patch, YARN-180.patch


 The capacity scheduler has the ability to 'reserve' containers.  
 Unfortunately before it decides that it goes to reserved rather then 
 assigned, the Container object is created which creates a container token 
 that expires in roughly 10 minutes by default.  
 This means that by the time the NM frees up enough space on that node for the 
 container to move to assigned the container token may have expired.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-139) Interrupted Exception within AsyncDispatcher leads to user confusion

2012-10-24 Thread Jason Lowe (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483298#comment-13483298
 ] 

Jason Lowe commented on YARN-139:
-

+1, thanks Vinod!  I'll commit this shortly.

 Interrupted Exception within AsyncDispatcher leads to user confusion
 

 Key: YARN-139
 URL: https://issues.apache.org/jira/browse/YARN-139
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.0.2-alpha, 0.23.4
Reporter: Nathan Roberts
Assignee: Vinod Kumar Vavilapalli
 Attachments: YARN-139-20121019.1.txt, YARN-139-20121019.txt, 
 YARN-139-20121023.txt, YARN-139.txt


 Successful applications tend to get InterruptedExceptions during shutdown. 
 The exception is harmless but it leads to lots of user confusion and 
 therefore could be cleaned up.
 2012-09-28 14:50:12,477 WARN [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.event.AsyncDispatcher: Interrupted Exception while 
 stopping
 java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1143)
   at java.lang.Thread.join(Thread.java:1196)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:105)
   at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
   at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:437)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:402)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
   at java.lang.Thread.run(Thread.java:619)
 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped.
 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.mapreduce.v2.app.MRAppMaster is stopped.
 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Exiting MR AppMaster..GoodBye

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

2012-10-24 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483305#comment-13483305
 ] 

Robert Joseph Evans commented on YARN-167:
--

I am still nervous about pulling in a big change like MAPREDUCE-3353 just to 
fix a Major bug.  I am not going to block this going in if you come up with a 
patch, but I really want to beat on the patch before we pull it into 0.23.  I 
just want to be sure that it fixes the issue, and does not destabilize 
anything. This is only a Major bug because the only time the job gets stuck is 
when a user sends it a kill command, so the user already wants the job to go 
away.  The job's tasks do go away, but the AM gets stuck and is taking up a 
small amount of resources on the queue, which is bad, but not the end of the 
world.

bq. {quote}There isn't anything like a missed state that is causing this issue 
if I understand Ravi's issue description correctly. {quote}
bq. Obviously, this could be wrong.

You are correct that the task attempt's state machine cannot really fix this 
unless it lies, which would be an ugly hack, but it seems that it is not the 
Task Attempt that is getting stuck.  I was thinking that KILL_WAIT is waiting 
for the wrong things.  In TaskImpl KILL_WAIT ignores T_ATTEMPT_FAILED and 
T_ATTEMPT_SUCCEEDED, when it should actually be keeping track of all pending 
attempts and exit KILL_WAIT when all pending attempts have exited, either with 
a kill, success or failure.  It is a bug for TaskImpl to assume that as soon as 
it sends a KILL to the task attempt that it will beat out all other events and 
kill the attempt.  JobImpl's state machine appears to do something like this 
already.


 AM stuck in KILL_WAIT for days
 --

 Key: YARN-167
 URL: https://issues.apache.org/jira/browse/YARN-167
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 0.23.3
Reporter: Ravi Prakash
Assignee: Vinod Kumar Vavilapalli
 Attachments: TaskAttemptStateGraph.jpg


 We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them 
 as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a 
 few maps running. All these maps were scheduled on nodes which are now in the 
 RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-139) Interrupted Exception within AsyncDispatcher leads to user confusion

2012-10-24 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483313#comment-13483313
 ] 

Hudson commented on YARN-139:
-

Integrated in Hadoop-trunk-Commit #2923 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/2923/])
YARN-139. Interrupted Exception within AsyncDispatcher leads to user 
confusion. Contributed by Vinod Kumar Vavilapalli (Revision 1401726)

 Result = SUCCESS
jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401726
Files : 
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java
* 
/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java
* /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java


 Interrupted Exception within AsyncDispatcher leads to user confusion
 

 Key: YARN-139
 URL: https://issues.apache.org/jira/browse/YARN-139
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 2.0.2-alpha, 0.23.4
Reporter: Nathan Roberts
Assignee: Vinod Kumar Vavilapalli
 Fix For: 2.0.3-alpha, 0.23.5

 Attachments: YARN-139-20121019.1.txt, YARN-139-20121019.txt, 
 YARN-139-20121023.txt, YARN-139.txt


 Successful applications tend to get InterruptedExceptions during shutdown. 
 The exception is harmless but it leads to lots of user confusion and 
 therefore could be cleaned up.
 2012-09-28 14:50:12,477 WARN [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.event.AsyncDispatcher: Interrupted Exception while 
 stopping
 java.lang.InterruptedException
   at java.lang.Object.wait(Native Method)
   at java.lang.Thread.join(Thread.java:1143)
   at java.lang.Thread.join(Thread.java:1196)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:105)
   at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99)
   at 
 org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:437)
   at 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:402)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126)
   at 
 org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
   at java.lang.Thread.run(Thread.java:619)
 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped.
 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.yarn.service.AbstractService: 
 Service:org.apache.hadoop.mapreduce.v2.app.MRAppMaster is stopped.
 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] 
 org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Exiting MR AppMaster..GoodBye

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

2012-10-24 Thread Ravi Prakash (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483339#comment-13483339
 ] 

Ravi Prakash commented on YARN-167:
---

bq. This is fine. Job waits for all tasks and taskAttempts to 'finish', not 
just killed. In this case, TA will succeed and inform the job about the same, 
so that the job doesn't wait for this task anymore.

Vinod! I'm sorry I might not be understanding how this happens. In TaskImpl : 
{noformat}
// Ignore-able transitions.
.addTransition(
TaskStateInternal.KILL_WAIT,
TaskStateInternal.KILL_WAIT,
EnumSet.of(TaskEventType.T_KILL,
TaskEventType.T_ATTEMPT_LAUNCHED,
TaskEventType.T_ATTEMPT_COMMIT_PENDING,
TaskEventType.T_ATTEMPT_FAILED,
TaskEventType.T_ATTEMPT_SUCCEEDED,
TaskEventType.T_ADD_SPEC_ATTEMPT))
{noformat}
So when the TaskAttemptImpl does indeed send T_ATTEMPT_SUCCEEDED, it is ignored 
by the TaskImpl, and its state stays KILL_WAIT. Am I missing something? Can you 
please point me to the code path?

 AM stuck in KILL_WAIT for days
 --

 Key: YARN-167
 URL: https://issues.apache.org/jira/browse/YARN-167
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 0.23.3
Reporter: Ravi Prakash
Assignee: Vinod Kumar Vavilapalli
 Attachments: TaskAttemptStateGraph.jpg


 We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them 
 as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a 
 few maps running. All these maps were scheduled on nodes which are now in the 
 RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

2012-10-24 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483413#comment-13483413
 ] 

Robert Joseph Evans commented on YARN-167:
--

Looking at the UI for one of the jobs that is stuck in this state and a heap 
dump for that AM, I can see that the JOB is in KILL_WAIT and so are many of its 
tasks.  But for all of the tasks in KILL_WAIT that I looked at the task 
attempts are all in FAILED, and none of them failed because of a node that 
disappeared.  It looks very much like TaskImpl just need to be able to handle 
T_ATTEMPT_FAILED and T_ATTEMPT_SUCCEEDED in the KILL_WAIT state, instead of 
ignoring them.  I will look to see if this also exists in 2.0.  I think all we 
need to do to reproduce this is to launch a large job that will have most of 
its tasks fail, and then try to kill it before the job fails on its own.

This particular job had 2645 map tasks, 634 of them got stuck in KILL_WAIT, 
1347 of them were successfully killed and 623 of the tasks finished with a 
SUCCESS. This was running on a 2,000 node cluster.  The failed tasks appeared 
to take about 20 seconds before they failed, but the last attempts to fail all 
ended within a second of each other.

 AM stuck in KILL_WAIT for days
 --

 Key: YARN-167
 URL: https://issues.apache.org/jira/browse/YARN-167
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 0.23.3
Reporter: Ravi Prakash
Assignee: Vinod Kumar Vavilapalli
 Attachments: TaskAttemptStateGraph.jpg


 We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them 
 as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a 
 few maps running. All these maps were scheduled on nodes which are now in the 
 RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-3) Add support for CPU isolation/monitoring of containers

2012-10-24 Thread Andrew Ferguson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483412#comment-13483412
 ] 

Andrew Ferguson commented on YARN-3:


thanks for the review [~vinodkv]. I'll post an updated patch on YARN-147. 
there's a lot of food for thought here (design questions), so here are some 
comments:

bq. yarn.nodemanager.linux-container-executor.cgroups.mount has different 
defaults in code and in yarn-default.xml

yeah -- personally, I think the default should be false since it's not clear 
what a sensible default mount path is. I had changed the line in the code in 
response to Tucu's comment [1], but I'm changing it back to false since true 
doesn't seem sensible to me. if anyone in the community has a sensible default 
mount path, then we can surely change the default to true in both the code and 
yarn-default.xml :-/

bq. Can you explain this? Is this sleep necessary. Depending on its importance, 
we'll need to fix the following Id check, AMs don't always have ID equaling one.

the sleep is necessary as sometimes the LCE reports that the container has 
exited, even though the AM process has not terminated. hence, because the 
process is still running, we can't remove the cgroup yet; therefore, the code 
sleeps briefly.

since the AM doesn't always have the ID of 1, what do you suggest I do to 
determine whether the container has the AM or not? if there isn't a good rule, 
the code can just always sleep before removing the cgroup.

bq. container-executor.c: If a mount-point is already mounted, mount gives a 
EBUSY error, mount_cgroup() will need to be fixed to support remounts (for e.g. 
on NM restarts). We could unmount cgroup fs on shutdown but that isn't always 
guaranteed.

great catch! thanks! I've made this non-fatal. now, the NM will attempt to 
re-mount the cgroup, will print a message that it can't do that because it's 
mounted, and everything will work, because it will simply work as in the case 
where the cluster admin has already mounted the cgroups.

bq. Not sure of the benefit of configurable 
yarn.nodemanager.linux-container-executor.cgroups.mount-path. Couldn't NM just 
always mount to a path that it creates and owns? Similar comment for the 
hierarchy-prefix.

for the hierarchy-prefix, this needs to be configurable since, in the scenario 
where the admin creates the cgroups in advance, the NM doesn't have privileges 
to create its own hierarchy.

for the mount-path, this is a good question. Linux distributions mount the 
cgroup controllers in various locations, so I thought it was better to keep it 
configurable, since I figured it would be confusing if the OS had already 
mounted some of the cgroup congrollers on /cgroup/ or /sys/fs/cgroup/, and then 
the NM started mounting additional controllers in /path/nm/owns/cgroup/.

bq. CgroupsLCEResourcesHandler is swallowing exceptions and errors in multiple 
places - updateCgroup() and createCgroup(). In the later, if cgroups are 
enabled, and we can't create the file, it is a critical error?

I'm fine either way. what would people prefer to see? is it better to launch a 
container even if we can't enforce the limits? or is it better to prevent the 
container from launching? happy to make the necessary quick change.

bq. Make ResourcesHandler top level. I'd like to merge the ContainersMonitor 
functionality with this so as to monitor/enforce memory limits also. 
ContainersMinotor is top-level, we should make ResourcesHandler also top-level 
so that other platforms don't need to create this type-hierarchy all over again 
when they wish to implement some or all of this functionality.

if I'm reading this correctly, yes, that is what I first wanted to do when I 
started this patch (see discussions at the top of this YARN-3 thread, the early 
patches for MAPREDUCE-4334, and the current YARN-4). however, it seems we have 
decided to go another way.



thank you,
Andrew


[1] 
https://issues.apache.org/jira/browse/YARN-147?focusedCommentId=13470926page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13470926

 Add support for CPU isolation/monitoring of containers
 --

 Key: YARN-3
 URL: https://issues.apache.org/jira/browse/YARN-3
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Andrew Ferguson
 Attachments: mapreduce-4334-design-doc.txt, 
 mapreduce-4334-design-doc-v2.txt, MAPREDUCE-4334-executor-v1.patch, 
 MAPREDUCE-4334-executor-v2.patch, MAPREDUCE-4334-executor-v3.patch, 
 MAPREDUCE-4334-executor-v4.patch, MAPREDUCE-4334-pre1.patch, 
 MAPREDUCE-4334-pre2.patch, MAPREDUCE-4334-pre2-with_cpu.patch, 
 MAPREDUCE-4334-pre3.patch, MAPREDUCE-4334-pre3-with_cpu.patch, 
 MAPREDUCE-4334-v1.patch, MAPREDUCE-4334-v2.patch, 

[jira] [Updated] (YARN-147) Add support for CPU isolation/monitoring of containers

2012-10-24 Thread Andrew Ferguson (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ferguson updated YARN-147:
-

Attachment: YARN-147-v6.patch

updated as per reviews on comments here and on YARN-3.

 Add support for CPU isolation/monitoring of containers
 --

 Key: YARN-147
 URL: https://issues.apache.org/jira/browse/YARN-147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.0.3-alpha
Reporter: Alejandro Abdelnur
Assignee: Andrew Ferguson
 Fix For: 2.0.3-alpha

 Attachments: YARN-147-v1.patch, YARN-147-v2.patch, YARN-147-v3.patch, 
 YARN-147-v4.patch, YARN-147-v5.patch, YARN-147-v6.patch, YARN-3.patch


 This is a clone for YARN-3 to be able to submit the patch as YARN-3 does not 
 show the SUBMIT PATCH button.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days

2012-10-24 Thread Robert Joseph Evans (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483417#comment-13483417
 ] 

Robert Joseph Evans commented on YARN-167:
--

Yes it looks very much like this can also happen in branch-2, and trunk.  I 
also wanted to mention that the stack traces showed more or less nothing.  All 
of the threads were waiting on I/O or event queues. Nothing was actually 
processing any data or deadlocked holding some locks.

 AM stuck in KILL_WAIT for days
 --

 Key: YARN-167
 URL: https://issues.apache.org/jira/browse/YARN-167
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 0.23.3
Reporter: Ravi Prakash
Assignee: Vinod Kumar Vavilapalli
 Attachments: TaskAttemptStateGraph.jpg


 We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them 
 as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a 
 few maps running. All these maps were scheduled on nodes which are now in the 
 RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-3) Add support for CPU isolation/monitoring of containers

2012-10-24 Thread Alejandro Abdelnur (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483467#comment-13483467
 ] 

Alejandro Abdelnur commented on YARN-3:
---

bq. CgroupsLCEResourcesHandler is swallowing exceptions 

The user expectation is that if Hadoop is configured to use cgroups, then 
Hadoop is using cgroups.

IMO, if we configure Hadoop to use cgroups, and for some reason it cannot, it 
should be treated as fatal. 

bq. Make ResourcesHandler top level

I'd defer this to a follow up patch.

 Add support for CPU isolation/monitoring of containers
 --

 Key: YARN-3
 URL: https://issues.apache.org/jira/browse/YARN-3
 Project: Hadoop YARN
  Issue Type: Sub-task
Reporter: Arun C Murthy
Assignee: Andrew Ferguson
 Attachments: mapreduce-4334-design-doc.txt, 
 mapreduce-4334-design-doc-v2.txt, MAPREDUCE-4334-executor-v1.patch, 
 MAPREDUCE-4334-executor-v2.patch, MAPREDUCE-4334-executor-v3.patch, 
 MAPREDUCE-4334-executor-v4.patch, MAPREDUCE-4334-pre1.patch, 
 MAPREDUCE-4334-pre2.patch, MAPREDUCE-4334-pre2-with_cpu.patch, 
 MAPREDUCE-4334-pre3.patch, MAPREDUCE-4334-pre3-with_cpu.patch, 
 MAPREDUCE-4334-v1.patch, MAPREDUCE-4334-v2.patch, YARN-3-lce_only-v1.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (YARN-129) Simplify classpath construction for mini YARN tests

2012-10-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483874#comment-13483874
 ] 

Hadoop QA commented on YARN-129:


{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  http://issues.apache.org/jira/secure/attachment/12550482/YARN-129.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 2 new 
or modified test files.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 eclipse:eclipse{color}.  The patch built with 
eclipse:eclipse.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 core tests{color}.  The patch failed these unit tests in 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell
 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher:

  org.apache.hadoop.mapred.TestClusterMRNotification

{color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-YARN-Build/124//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/124//console

This message is automatically generated.

 Simplify classpath construction for mini YARN tests
 ---

 Key: YARN-129
 URL: https://issues.apache.org/jira/browse/YARN-129
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: client
Reporter: Tom White
Assignee: Tom White
 Attachments: YARN-129.patch, YARN-129.patch, YARN-129.patch


 The test classpath includes a special file called 'mrapp-generated-classpath' 
 (or similar in distributed shell) that is constructed at build time, and 
 whose contents are a classpath with all the dependencies needed to run the 
 tests. When the classpath for a container (e.g. the AM) is constructed the 
 contents of mrapp-generated-classpath is read and added to the classpath, and 
 the file itself is then added to the classpath so that later when the AM 
 constructs a classpath for a task container it can propagate the test 
 classpath correctly.
 This mechanism can be drastically simplified by propagating the system 
 classpath of the current JVM (read from the java.class.path property) to a 
 launched JVM, but only if running in the context of the mini YARN cluster. 
 Any tests that use the mini YARN cluster will automatically work with this 
 change. Although any that explicitly deal with mrapp-generated-classpath can 
 be simplified.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira