[jira] [Commented] (YARN-181) capacity-scheduler.xml move breaks Eclipse import
[ https://issues.apache.org/jira/browse/YARN-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483194#comment-13483194 ] Hudson commented on YARN-181: - Integrated in Hadoop-Hdfs-trunk #1205 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1205/]) YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via YARN-140. Contributed by Siddharth Seth. (Revision 1401504) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504 Files : * /hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml capacity-scheduler.xml move breaks Eclipse import - Key: YARN-181 URL: https://issues.apache.org/jira/browse/YARN-181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.2-alpha Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Critical Fix For: 2.0.3-alpha Attachments: YARN181_jenkins.txt, YARN181_postSvnMv.txt, YARN181_svn_mv.sh Eclipse doesn't seem to handle testResources which resolve to an absolute path. YARN-140 moved capacity-scheduler.cfg a couple of levels up to the hadoop-yarn project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-140) Add capacity-scheduler-default.xml to provide a default set of configurations for the capacity scheduler.
[ https://issues.apache.org/jira/browse/YARN-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483196#comment-13483196 ] Hudson commented on YARN-140: - Integrated in Hadoop-Hdfs-trunk #1205 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1205/]) YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via YARN-140. Contributed by Siddharth Seth. (Revision 1401504) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504 Files : * /hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml Add capacity-scheduler-default.xml to provide a default set of configurations for the capacity scheduler. - Key: YARN-140 URL: https://issues.apache.org/jira/browse/YARN-140 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Ahmed Radwan Assignee: Ahmed Radwan Fix For: 2.0.3-alpha Attachments: YARN-140.patch, YARN-140_rev2.patch, YARN-140_rev3.patch, YARN-140_rev4.patch, YARN-140_rev5_onlyForJenkins.patch, YARN-140_rev5.patch, YARN-140_rev5_svn_mv.patch, YARN-140_rev6_onlyForJenkins.patch, YARN-140_rev6.patch, YARN-140_rev7_onlyForJenkins.patch, YARN-140_rev8_onlyForJenkins.patch, YARN-140_rev9.patch, YARN-140_rev9_svn_mv.patch When setting up the capacity scheduler users are faced with problems like: {code} FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager java.lang.IllegalArgumentException: Illegal capacity of -1 for queue root {code} Which basically arises from missing basic configurations, which in many cases, there is no need to explicitly provide, and a default configuration will be sufficient. For example, to address the error above, the user need to add a capacity of 100 to the root queue. So, we need to add a capacity-scheduler-default.xml, this will be helpful to provide the basic set of default configurations required to run the capacity scheduler. The user can still override existing configurations or provide new ones in capacity-scheduler.xml. This is similar to *-default.xml vs *-site.xml for yarn, core, mapred, hdfs, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-177) CapacityScheduler - adding a queue while the RM is running has wacky results
[ https://issues.apache.org/jira/browse/YARN-177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483211#comment-13483211 ] Thomas Graves commented on YARN-177: +1. Thanks Arun! I'll commit this shortly. CapacityScheduler - adding a queue while the RM is running has wacky results Key: YARN-177 URL: https://issues.apache.org/jira/browse/YARN-177 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.3 Reporter: Thomas Graves Assignee: Arun C Murthy Priority: Critical Fix For: 2.0.3-alpha, 0.23.5 Attachments: YARN-177.patch, YARN-177.patch, YARN-177.patch, YARN-177.patch Adding a queue to the capacity scheduler while the RM is running and then running a job in the queue added results in very strange behavior. The cluster Total Memory can either decrease or increase. We had a cluster where total memory decreased to almost 1/6th the capacity. Running on a small test cluster resulted in the capacity going up by simply adding a queue and running wordcount. Looking at the RM logs, used memory can go negative but other logs show the number positive: 2012-10-21 22:56:44,796 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.0375 absoluteUsedCapacity=0.0375 used=memory: 7680 cluster=memory: 204800 2012-10-21 22:56:45,831 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=-0.0225 absoluteUsedCapacity=-0.0225 used=memory: -4608 cluster=memory: 204800 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-177) CapacityScheduler - adding a queue while the RM is running has wacky results
[ https://issues.apache.org/jira/browse/YARN-177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483221#comment-13483221 ] Hudson commented on YARN-177: - Integrated in Hadoop-trunk-Commit #2920 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2920/]) YARN-177. CapacityScheduler - adding a queue while the RM is running has wacky results (acmurthy vai tgraves) (Revision 1401668) Result = SUCCESS tgraves : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401668 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacityScheduler.java CapacityScheduler - adding a queue while the RM is running has wacky results Key: YARN-177 URL: https://issues.apache.org/jira/browse/YARN-177 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.3 Reporter: Thomas Graves Assignee: Arun C Murthy Priority: Critical Fix For: 2.0.3-alpha, 0.23.5 Attachments: YARN-177.patch, YARN-177.patch, YARN-177.patch, YARN-177.patch Adding a queue to the capacity scheduler while the RM is running and then running a job in the queue added results in very strange behavior. The cluster Total Memory can either decrease or increase. We had a cluster where total memory decreased to almost 1/6th the capacity. Running on a small test cluster resulted in the capacity going up by simply adding a queue and running wordcount. Looking at the RM logs, used memory can go negative but other logs show the number positive: 2012-10-21 22:56:44,796 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: assignedContainer queue=root usedCapacity=0.0375 absoluteUsedCapacity=0.0375 used=memory: 7680 cluster=memory: 204800 2012-10-21 22:56:45,831 [ResourceManager Event Processor] INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue: completedContainer queue=root usedCapacity=-0.0225 absoluteUsedCapacity=-0.0225 used=memory: -4608 cluster=memory: 204800 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-181) capacity-scheduler.xml move breaks Eclipse import
[ https://issues.apache.org/jira/browse/YARN-181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483239#comment-13483239 ] Hudson commented on YARN-181: - Integrated in Hadoop-Mapreduce-trunk #1235 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1235/]) YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via YARN-140. Contributed by Siddharth Seth. (Revision 1401504) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504 Files : * /hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml capacity-scheduler.xml move breaks Eclipse import - Key: YARN-181 URL: https://issues.apache.org/jira/browse/YARN-181 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.0.2-alpha Reporter: Siddharth Seth Assignee: Siddharth Seth Priority: Critical Fix For: 2.0.3-alpha Attachments: YARN181_jenkins.txt, YARN181_postSvnMv.txt, YARN181_svn_mv.sh Eclipse doesn't seem to handle testResources which resolve to an absolute path. YARN-140 moved capacity-scheduler.cfg a couple of levels up to the hadoop-yarn project. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-140) Add capacity-scheduler-default.xml to provide a default set of configurations for the capacity scheduler.
[ https://issues.apache.org/jira/browse/YARN-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483241#comment-13483241 ] Hudson commented on YARN-140: - Integrated in Hadoop-Mapreduce-trunk #1235 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1235/]) YARN-181. Fixed eclipse settings broken by capacity-scheduler.xml move via YARN-140. Contributed by Siddharth Seth. (Revision 1401504) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401504 Files : * /hadoop/common/trunk/hadoop-assemblies/src/main/resources/assemblies/hadoop-yarn-dist.xml * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml Add capacity-scheduler-default.xml to provide a default set of configurations for the capacity scheduler. - Key: YARN-140 URL: https://issues.apache.org/jira/browse/YARN-140 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: Ahmed Radwan Assignee: Ahmed Radwan Fix For: 2.0.3-alpha Attachments: YARN-140.patch, YARN-140_rev2.patch, YARN-140_rev3.patch, YARN-140_rev4.patch, YARN-140_rev5_onlyForJenkins.patch, YARN-140_rev5.patch, YARN-140_rev5_svn_mv.patch, YARN-140_rev6_onlyForJenkins.patch, YARN-140_rev6.patch, YARN-140_rev7_onlyForJenkins.patch, YARN-140_rev8_onlyForJenkins.patch, YARN-140_rev9.patch, YARN-140_rev9_svn_mv.patch When setting up the capacity scheduler users are faced with problems like: {code} FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error starting ResourceManager java.lang.IllegalArgumentException: Illegal capacity of -1 for queue root {code} Which basically arises from missing basic configurations, which in many cases, there is no need to explicitly provide, and a default configuration will be sufficient. For example, to address the error above, the user need to add a capacity of 100 to the root queue. So, we need to add a capacity-scheduler-default.xml, this will be helpful to provide the basic set of default configurations required to run the capacity scheduler. The user can still override existing configurations or provide new ones in capacity-scheduler.xml. This is similar to *-default.xml vs *-site.xml for yarn, core, mapred, hdfs, etc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-179) Bunch of test failures on trunk
[ https://issues.apache.org/jira/browse/YARN-179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483242#comment-13483242 ] Hudson commented on YARN-179: - Integrated in Hadoop-Mapreduce-trunk #1235 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1235/]) YARN-179. Fix some unit test failures. (Contributed by Vinod Kumar Vavilapalli) (Revision 1401481) Result = SUCCESS sseth : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401481 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/main/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/UnmanagedAMLauncher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher/src/test/java/org/apache/hadoop/yarn/applications/unmanagedamlauncher/TestUnmanagedAMLauncher.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/pom.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java Bunch of test failures on trunk --- Key: YARN-179 URL: https://issues.apache.org/jira/browse/YARN-179 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 2.0.2-alpha Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Priority: Blocker Fix For: 2.0.3-alpha Attachments: YARN-179-20121022.3.txt, YARN-179-20121022.4.txt {{CapacityScheduler.setConf()}} mandates a YarnConfiguration. It doesn't need to, throughout all of YARN, components only depend on Configuration and depend on the callers to provide correct configuration. This is causing multiple tests to fail. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Moved] (YARN-185) Add preemption to CS
[ https://issues.apache.org/jira/browse/YARN-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy moved MAPREDUCE-3938 to YARN-185: --- Component/s: (was: mrv2) Key: YARN-185 (was: MAPREDUCE-3938) Project: Hadoop YARN (was: Hadoop Map/Reduce) Add preemption to CS Key: YARN-185 URL: https://issues.apache.org/jira/browse/YARN-185 Project: Hadoop YARN Issue Type: New Feature Reporter: Arun C Murthy Assignee: Arun C Murthy Umbrella jira to track adding preemption to CS, let's track via sub-tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-147) Add support for CPU isolation/monitoring of containers
[ https://issues.apache.org/jira/browse/YARN-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483255#comment-13483255 ] Arun C Murthy commented on YARN-147: Cancelling patch while comments are addressed, particularly the one Sid raised - we can't break LCE. Also, we need to make sure this continues to work on RHEL5/CentOS5 which doesn't have cgroups. One more thing - can we please do reviews/discussions on YARN-3 to ensure we keep track in one place? Thanks. Add support for CPU isolation/monitoring of containers -- Key: YARN-147 URL: https://issues.apache.org/jira/browse/YARN-147 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Andrew Ferguson Fix For: 2.0.3-alpha Attachments: YARN-147-v1.patch, YARN-147-v2.patch, YARN-147-v3.patch, YARN-147-v4.patch, YARN-147-v5.patch, YARN-3.patch This is a clone for YARN-3 to be able to submit the patch as YARN-3 does not show the SUBMIT PATCH button. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-180) Capacity scheduler - containers that get reserved create container token to early
[ https://issues.apache.org/jira/browse/YARN-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483259#comment-13483259 ] Robert Joseph Evans commented on YARN-180: -- Thanks for the review Tom, I'll check it in now. Also the port to 0.23 looks clean, a simple refactoring, so +1 for that too. Capacity scheduler - containers that get reserved create container token to early - Key: YARN-180 URL: https://issues.apache.org/jira/browse/YARN-180 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.3 Reporter: Thomas Graves Assignee: Arun C Murthy Priority: Critical Fix For: 2.0.3-alpha, 0.23.5 Attachments: YARN-180-branch_0.23.patch, YARN-180.patch, YARN-180.patch, YARN-180.patch The capacity scheduler has the ability to 'reserve' containers. Unfortunately before it decides that it goes to reserved rather then assigned, the Container object is created which creates a container token that expires in roughly 10 minutes by default. This means that by the time the NM frees up enough space on that node for the container to move to assigned the container token may have expired. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-178) Fix custom ProcessTree instance creation
[ https://issues.apache.org/jira/browse/YARN-178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483262#comment-13483262 ] Hudson commented on YARN-178: - Integrated in Hadoop-trunk-Commit #2921 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2921/]) YARN-178. Fix custom ProcessTree instance creation (Radim Kolar via bobby) (Revision 1401698) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401698 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java Fix custom ProcessTree instance creation Key: YARN-178 URL: https://issues.apache.org/jira/browse/YARN-178 Project: Hadoop YARN Issue Type: Bug Affects Versions: 3.0.0, 0.23.5 Reporter: Radim Kolar Assignee: Radim Kolar Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: pstree-instance2.txt, pstree-instance.txt 1. In current pluggable resourcecalculatorprocesstree is not passed root process id to custom implementation making it unusable. 2. pstree do not extend Configured as it should Added constructor with pid argument with testsuite. Also added test that pstree is correctly configured. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-3) Add support for CPU isolation/monitoring of containers
[ https://issues.apache.org/jira/browse/YARN-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483266#comment-13483266 ] Andrew Ferguson commented on YARN-3: (replying to comments on YARN-147 here instead as per [~acmurthy]'s request) thanks for catching that bug [~sseth]! I've updated my git repo [1], and will post a new patch after addressing the review from [~vinodkone]. I successfully tested it quite a bit with and without cgroups back in the summer, but it seems the patch has shifted enough since the testing that I should do it again. [1] https://github.com/adferguson/hadoop-common/commits/adf-yarn-147 Add support for CPU isolation/monitoring of containers -- Key: YARN-3 URL: https://issues.apache.org/jira/browse/YARN-3 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Andrew Ferguson Attachments: mapreduce-4334-design-doc.txt, mapreduce-4334-design-doc-v2.txt, MAPREDUCE-4334-executor-v1.patch, MAPREDUCE-4334-executor-v2.patch, MAPREDUCE-4334-executor-v3.patch, MAPREDUCE-4334-executor-v4.patch, MAPREDUCE-4334-pre1.patch, MAPREDUCE-4334-pre2.patch, MAPREDUCE-4334-pre2-with_cpu.patch, MAPREDUCE-4334-pre3.patch, MAPREDUCE-4334-pre3-with_cpu.patch, MAPREDUCE-4334-v1.patch, MAPREDUCE-4334-v2.patch, YARN-3-lce_only-v1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-147) Add support for CPU isolation/monitoring of containers
[ https://issues.apache.org/jira/browse/YARN-147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483268#comment-13483268 ] Andrew Ferguson commented on YARN-147: -- hi [~acmurthy], I've started posting replies on YARN-3 instead. the LCE bug is fixed and I'll post a new patch after addressing [~vinodkv]'s comments. thanks! Add support for CPU isolation/monitoring of containers -- Key: YARN-147 URL: https://issues.apache.org/jira/browse/YARN-147 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Andrew Ferguson Fix For: 2.0.3-alpha Attachments: YARN-147-v1.patch, YARN-147-v2.patch, YARN-147-v3.patch, YARN-147-v4.patch, YARN-147-v5.patch, YARN-3.patch This is a clone for YARN-3 to be able to submit the patch as YARN-3 does not show the SUBMIT PATCH button. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-180) Capacity scheduler - containers that get reserved create container token to early
[ https://issues.apache.org/jira/browse/YARN-180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483270#comment-13483270 ] Hudson commented on YARN-180: - Integrated in Hadoop-trunk-Commit #2922 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2922/]) YARN-180. Capacity scheduler - containers that get reserved create container token to early (acmurthy and bobby) (Revision 1401703) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401703 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java Capacity scheduler - containers that get reserved create container token to early - Key: YARN-180 URL: https://issues.apache.org/jira/browse/YARN-180 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 0.23.3 Reporter: Thomas Graves Assignee: Arun C Murthy Priority: Critical Fix For: 3.0.0, 2.0.3-alpha, 0.23.5 Attachments: YARN-180-branch_0.23.patch, YARN-180.patch, YARN-180.patch, YARN-180.patch The capacity scheduler has the ability to 'reserve' containers. Unfortunately before it decides that it goes to reserved rather then assigned, the Container object is created which creates a container token that expires in roughly 10 minutes by default. This means that by the time the NM frees up enough space on that node for the container to move to assigned the container token may have expired. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-139) Interrupted Exception within AsyncDispatcher leads to user confusion
[ https://issues.apache.org/jira/browse/YARN-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483298#comment-13483298 ] Jason Lowe commented on YARN-139: - +1, thanks Vinod! I'll commit this shortly. Interrupted Exception within AsyncDispatcher leads to user confusion Key: YARN-139 URL: https://issues.apache.org/jira/browse/YARN-139 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.0.2-alpha, 0.23.4 Reporter: Nathan Roberts Assignee: Vinod Kumar Vavilapalli Attachments: YARN-139-20121019.1.txt, YARN-139-20121019.txt, YARN-139-20121023.txt, YARN-139.txt Successful applications tend to get InterruptedExceptions during shutdown. The exception is harmless but it leads to lots of user confusion and therefore could be cleaned up. 2012-09-28 14:50:12,477 WARN [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:105) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:437) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:402) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:619) 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped. 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.mapreduce.v2.app.MRAppMaster is stopped. 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Exiting MR AppMaster..GoodBye -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days
[ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483305#comment-13483305 ] Robert Joseph Evans commented on YARN-167: -- I am still nervous about pulling in a big change like MAPREDUCE-3353 just to fix a Major bug. I am not going to block this going in if you come up with a patch, but I really want to beat on the patch before we pull it into 0.23. I just want to be sure that it fixes the issue, and does not destabilize anything. This is only a Major bug because the only time the job gets stuck is when a user sends it a kill command, so the user already wants the job to go away. The job's tasks do go away, but the AM gets stuck and is taking up a small amount of resources on the queue, which is bad, but not the end of the world. bq. {quote}There isn't anything like a missed state that is causing this issue if I understand Ravi's issue description correctly. {quote} bq. Obviously, this could be wrong. You are correct that the task attempt's state machine cannot really fix this unless it lies, which would be an ugly hack, but it seems that it is not the Task Attempt that is getting stuck. I was thinking that KILL_WAIT is waiting for the wrong things. In TaskImpl KILL_WAIT ignores T_ATTEMPT_FAILED and T_ATTEMPT_SUCCEEDED, when it should actually be keeping track of all pending attempts and exit KILL_WAIT when all pending attempts have exited, either with a kill, success or failure. It is a bug for TaskImpl to assume that as soon as it sends a KILL to the task attempt that it will beat out all other events and kill the attempt. JobImpl's state machine appears to do something like this already. AM stuck in KILL_WAIT for days -- Key: YARN-167 URL: https://issues.apache.org/jira/browse/YARN-167 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.3 Reporter: Ravi Prakash Assignee: Vinod Kumar Vavilapalli Attachments: TaskAttemptStateGraph.jpg We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-139) Interrupted Exception within AsyncDispatcher leads to user confusion
[ https://issues.apache.org/jira/browse/YARN-139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483313#comment-13483313 ] Hudson commented on YARN-139: - Integrated in Hadoop-trunk-Commit #2923 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/2923/]) YARN-139. Interrupted Exception within AsyncDispatcher leads to user confusion. Contributed by Vinod Kumar Vavilapalli (Revision 1401726) Result = SUCCESS jlowe : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1401726 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/MRAppMaster.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestStagingCleanup.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/event/AsyncDispatcher.java Interrupted Exception within AsyncDispatcher leads to user confusion Key: YARN-139 URL: https://issues.apache.org/jira/browse/YARN-139 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.0.2-alpha, 0.23.4 Reporter: Nathan Roberts Assignee: Vinod Kumar Vavilapalli Fix For: 2.0.3-alpha, 0.23.5 Attachments: YARN-139-20121019.1.txt, YARN-139-20121019.txt, YARN-139-20121023.txt, YARN-139.txt Successful applications tend to get InterruptedExceptions during shutdown. The exception is harmless but it leads to lots of user confusion and therefore could be cleaned up. 2012-09-28 14:50:12,477 WARN [AsyncDispatcher event handler] org.apache.hadoop.yarn.event.AsyncDispatcher: Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:105) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:437) at org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobFinishEventHandler.handle(MRAppMaster.java:402) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) at java.lang.Thread.run(Thread.java:619) 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:Dispatcher is stopped. 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.yarn.service.AbstractService: Service:org.apache.hadoop.mapreduce.v2.app.MRAppMaster is stopped. 2012-09-28 14:50:12,477 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Exiting MR AppMaster..GoodBye -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days
[ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483339#comment-13483339 ] Ravi Prakash commented on YARN-167: --- bq. This is fine. Job waits for all tasks and taskAttempts to 'finish', not just killed. In this case, TA will succeed and inform the job about the same, so that the job doesn't wait for this task anymore. Vinod! I'm sorry I might not be understanding how this happens. In TaskImpl : {noformat} // Ignore-able transitions. .addTransition( TaskStateInternal.KILL_WAIT, TaskStateInternal.KILL_WAIT, EnumSet.of(TaskEventType.T_KILL, TaskEventType.T_ATTEMPT_LAUNCHED, TaskEventType.T_ATTEMPT_COMMIT_PENDING, TaskEventType.T_ATTEMPT_FAILED, TaskEventType.T_ATTEMPT_SUCCEEDED, TaskEventType.T_ADD_SPEC_ATTEMPT)) {noformat} So when the TaskAttemptImpl does indeed send T_ATTEMPT_SUCCEEDED, it is ignored by the TaskImpl, and its state stays KILL_WAIT. Am I missing something? Can you please point me to the code path? AM stuck in KILL_WAIT for days -- Key: YARN-167 URL: https://issues.apache.org/jira/browse/YARN-167 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.3 Reporter: Ravi Prakash Assignee: Vinod Kumar Vavilapalli Attachments: TaskAttemptStateGraph.jpg We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days
[ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483413#comment-13483413 ] Robert Joseph Evans commented on YARN-167: -- Looking at the UI for one of the jobs that is stuck in this state and a heap dump for that AM, I can see that the JOB is in KILL_WAIT and so are many of its tasks. But for all of the tasks in KILL_WAIT that I looked at the task attempts are all in FAILED, and none of them failed because of a node that disappeared. It looks very much like TaskImpl just need to be able to handle T_ATTEMPT_FAILED and T_ATTEMPT_SUCCEEDED in the KILL_WAIT state, instead of ignoring them. I will look to see if this also exists in 2.0. I think all we need to do to reproduce this is to launch a large job that will have most of its tasks fail, and then try to kill it before the job fails on its own. This particular job had 2645 map tasks, 634 of them got stuck in KILL_WAIT, 1347 of them were successfully killed and 623 of the tasks finished with a SUCCESS. This was running on a 2,000 node cluster. The failed tasks appeared to take about 20 seconds before they failed, but the last attempts to fail all ended within a second of each other. AM stuck in KILL_WAIT for days -- Key: YARN-167 URL: https://issues.apache.org/jira/browse/YARN-167 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.3 Reporter: Ravi Prakash Assignee: Vinod Kumar Vavilapalli Attachments: TaskAttemptStateGraph.jpg We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-3) Add support for CPU isolation/monitoring of containers
[ https://issues.apache.org/jira/browse/YARN-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483412#comment-13483412 ] Andrew Ferguson commented on YARN-3: thanks for the review [~vinodkv]. I'll post an updated patch on YARN-147. there's a lot of food for thought here (design questions), so here are some comments: bq. yarn.nodemanager.linux-container-executor.cgroups.mount has different defaults in code and in yarn-default.xml yeah -- personally, I think the default should be false since it's not clear what a sensible default mount path is. I had changed the line in the code in response to Tucu's comment [1], but I'm changing it back to false since true doesn't seem sensible to me. if anyone in the community has a sensible default mount path, then we can surely change the default to true in both the code and yarn-default.xml :-/ bq. Can you explain this? Is this sleep necessary. Depending on its importance, we'll need to fix the following Id check, AMs don't always have ID equaling one. the sleep is necessary as sometimes the LCE reports that the container has exited, even though the AM process has not terminated. hence, because the process is still running, we can't remove the cgroup yet; therefore, the code sleeps briefly. since the AM doesn't always have the ID of 1, what do you suggest I do to determine whether the container has the AM or not? if there isn't a good rule, the code can just always sleep before removing the cgroup. bq. container-executor.c: If a mount-point is already mounted, mount gives a EBUSY error, mount_cgroup() will need to be fixed to support remounts (for e.g. on NM restarts). We could unmount cgroup fs on shutdown but that isn't always guaranteed. great catch! thanks! I've made this non-fatal. now, the NM will attempt to re-mount the cgroup, will print a message that it can't do that because it's mounted, and everything will work, because it will simply work as in the case where the cluster admin has already mounted the cgroups. bq. Not sure of the benefit of configurable yarn.nodemanager.linux-container-executor.cgroups.mount-path. Couldn't NM just always mount to a path that it creates and owns? Similar comment for the hierarchy-prefix. for the hierarchy-prefix, this needs to be configurable since, in the scenario where the admin creates the cgroups in advance, the NM doesn't have privileges to create its own hierarchy. for the mount-path, this is a good question. Linux distributions mount the cgroup controllers in various locations, so I thought it was better to keep it configurable, since I figured it would be confusing if the OS had already mounted some of the cgroup congrollers on /cgroup/ or /sys/fs/cgroup/, and then the NM started mounting additional controllers in /path/nm/owns/cgroup/. bq. CgroupsLCEResourcesHandler is swallowing exceptions and errors in multiple places - updateCgroup() and createCgroup(). In the later, if cgroups are enabled, and we can't create the file, it is a critical error? I'm fine either way. what would people prefer to see? is it better to launch a container even if we can't enforce the limits? or is it better to prevent the container from launching? happy to make the necessary quick change. bq. Make ResourcesHandler top level. I'd like to merge the ContainersMonitor functionality with this so as to monitor/enforce memory limits also. ContainersMinotor is top-level, we should make ResourcesHandler also top-level so that other platforms don't need to create this type-hierarchy all over again when they wish to implement some or all of this functionality. if I'm reading this correctly, yes, that is what I first wanted to do when I started this patch (see discussions at the top of this YARN-3 thread, the early patches for MAPREDUCE-4334, and the current YARN-4). however, it seems we have decided to go another way. thank you, Andrew [1] https://issues.apache.org/jira/browse/YARN-147?focusedCommentId=13470926page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13470926 Add support for CPU isolation/monitoring of containers -- Key: YARN-3 URL: https://issues.apache.org/jira/browse/YARN-3 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Andrew Ferguson Attachments: mapreduce-4334-design-doc.txt, mapreduce-4334-design-doc-v2.txt, MAPREDUCE-4334-executor-v1.patch, MAPREDUCE-4334-executor-v2.patch, MAPREDUCE-4334-executor-v3.patch, MAPREDUCE-4334-executor-v4.patch, MAPREDUCE-4334-pre1.patch, MAPREDUCE-4334-pre2.patch, MAPREDUCE-4334-pre2-with_cpu.patch, MAPREDUCE-4334-pre3.patch, MAPREDUCE-4334-pre3-with_cpu.patch, MAPREDUCE-4334-v1.patch, MAPREDUCE-4334-v2.patch,
[jira] [Updated] (YARN-147) Add support for CPU isolation/monitoring of containers
[ https://issues.apache.org/jira/browse/YARN-147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ferguson updated YARN-147: - Attachment: YARN-147-v6.patch updated as per reviews on comments here and on YARN-3. Add support for CPU isolation/monitoring of containers -- Key: YARN-147 URL: https://issues.apache.org/jira/browse/YARN-147 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Andrew Ferguson Fix For: 2.0.3-alpha Attachments: YARN-147-v1.patch, YARN-147-v2.patch, YARN-147-v3.patch, YARN-147-v4.patch, YARN-147-v5.patch, YARN-147-v6.patch, YARN-3.patch This is a clone for YARN-3 to be able to submit the patch as YARN-3 does not show the SUBMIT PATCH button. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-167) AM stuck in KILL_WAIT for days
[ https://issues.apache.org/jira/browse/YARN-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483417#comment-13483417 ] Robert Joseph Evans commented on YARN-167: -- Yes it looks very much like this can also happen in branch-2, and trunk. I also wanted to mention that the stack traces showed more or less nothing. All of the threads were waiting on I/O or event queues. Nothing was actually processing any data or deadlocked holding some locks. AM stuck in KILL_WAIT for days -- Key: YARN-167 URL: https://issues.apache.org/jira/browse/YARN-167 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.3 Reporter: Ravi Prakash Assignee: Vinod Kumar Vavilapalli Attachments: TaskAttemptStateGraph.jpg We found some jobs were stuck in KILL_WAIT for days on end. The RM shows them as RUNNING. When you go to the AM, it shows it in the KILL_WAIT state, and a few maps running. All these maps were scheduled on nodes which are now in the RM's Lost nodes list. The running maps are in the FAIL_CONTAINER_CLEANUP state -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-3) Add support for CPU isolation/monitoring of containers
[ https://issues.apache.org/jira/browse/YARN-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483467#comment-13483467 ] Alejandro Abdelnur commented on YARN-3: --- bq. CgroupsLCEResourcesHandler is swallowing exceptions The user expectation is that if Hadoop is configured to use cgroups, then Hadoop is using cgroups. IMO, if we configure Hadoop to use cgroups, and for some reason it cannot, it should be treated as fatal. bq. Make ResourcesHandler top level I'd defer this to a follow up patch. Add support for CPU isolation/monitoring of containers -- Key: YARN-3 URL: https://issues.apache.org/jira/browse/YARN-3 Project: Hadoop YARN Issue Type: Sub-task Reporter: Arun C Murthy Assignee: Andrew Ferguson Attachments: mapreduce-4334-design-doc.txt, mapreduce-4334-design-doc-v2.txt, MAPREDUCE-4334-executor-v1.patch, MAPREDUCE-4334-executor-v2.patch, MAPREDUCE-4334-executor-v3.patch, MAPREDUCE-4334-executor-v4.patch, MAPREDUCE-4334-pre1.patch, MAPREDUCE-4334-pre2.patch, MAPREDUCE-4334-pre2-with_cpu.patch, MAPREDUCE-4334-pre3.patch, MAPREDUCE-4334-pre3-with_cpu.patch, MAPREDUCE-4334-v1.patch, MAPREDUCE-4334-v2.patch, YARN-3-lce_only-v1.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-129) Simplify classpath construction for mini YARN tests
[ https://issues.apache.org/jira/browse/YARN-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13483874#comment-13483874 ] Hadoop QA commented on YARN-129: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12550482/YARN-129.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-unmanaged-am-launcher: org.apache.hadoop.mapred.TestClusterMRNotification {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/124//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/124//console This message is automatically generated. Simplify classpath construction for mini YARN tests --- Key: YARN-129 URL: https://issues.apache.org/jira/browse/YARN-129 Project: Hadoop YARN Issue Type: Improvement Components: client Reporter: Tom White Assignee: Tom White Attachments: YARN-129.patch, YARN-129.patch, YARN-129.patch The test classpath includes a special file called 'mrapp-generated-classpath' (or similar in distributed shell) that is constructed at build time, and whose contents are a classpath with all the dependencies needed to run the tests. When the classpath for a container (e.g. the AM) is constructed the contents of mrapp-generated-classpath is read and added to the classpath, and the file itself is then added to the classpath so that later when the AM constructs a classpath for a task container it can propagate the test classpath correctly. This mechanism can be drastically simplified by propagating the system classpath of the current JVM (read from the java.class.path property) to a launched JVM, but only if running in the context of the mini YARN cluster. Any tests that use the mini YARN cluster will automatically work with this change. Although any that explicitly deal with mrapp-generated-classpath can be simplified. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira