[jira] [Commented] (YARN-569) CapacityScheduler: support for preemption (using a capacity monitor)
[ https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691072#comment-13691072 ] Bikas Saha commented on YARN-569: - Changes look good overall. Didnt look deeply at the preemption heuristics since we shall probably be working on them as we do more experimentation on real workloads. +1. CapacityScheduler: support for preemption (using a capacity monitor) Key: YARN-569 URL: https://issues.apache.org/jira/browse/YARN-569 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, preemption.2.patch, YARN-569.1.patch, YARN-569.2.patch, YARN-569.3.patch, YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch, YARN-569.8.patch, YARN-569.9.patch, YARN-569.patch, YARN-569.patch There is a tension between the fast-pace reactive role of the CapacityScheduler, which needs to respond quickly to applications resource requests, and node updates, and the more introspective, time-based considerations needed to observe and correct for capacity balance. To this purpose we opted instead of hacking the delicate mechanisms of the CapacityScheduler directly to add support for preemption by means of a Capacity Monitor, which can be run optionally as a separate service (much like the NMLivelinessMonitor). The capacity monitor (similarly to equivalent functionalities in the fairness scheduler) operates running on intervals (e.g., every 3 seconds), observe the state of the assignment of resources to queues from the capacity scheduler, performs off-line computation to determine if preemption is needed, and how best to edit the current schedule to improve capacity, and generates events that produce four possible actions: # Container de-reservations # Resource-based preemptions # Container-based preemptions # Container killing The actions listed above are progressively more costly, and it is up to the policy to use them as desired to achieve the rebalancing goals. Note that due to the lag in the effect of these actions the policy should operate at the macroscopic level (e.g., preempt tens of containers from a queue) and not trying to tightly and consistently micromanage container allocations. - Preemption policy (ProportionalCapacityPreemptionPolicy): - Preemption policies are by design pluggable, in the following we present an initial policy (ProportionalCapacityPreemptionPolicy) we have been experimenting with. The ProportionalCapacityPreemptionPolicy behaves as follows: # it gathers from the scheduler the state of the queues, in particular, their current capacity, guaranteed capacity and pending requests (*) # if there are pending requests from queues that are under capacity it computes a new ideal balanced state (**) # it computes the set of preemptions needed to repair the current schedule and achieve capacity balance (accounting for natural completion rates, and respecting bounds on the amount of preemption we allow for each round) # it selects which applications to preempt from each over-capacity queue (the last one in the FIFO order) # it remove reservations from the most recently assigned app until the amount of resource to reclaim is obtained, or until no more reservations exits # (if not enough) it issues preemptions for containers from the same applications (reverse chronological order, last assigned container first) again until necessary or until no containers except the AM container are left, # (if not enough) it moves onto unreserve and preempt from the next application. # containers that have been asked to preempt are tracked across executions. If a containers is among the one to be preempted for more than a certain time, the container is moved in a the list of containers to be forcibly killed. Notes: (*) at the moment, in order to avoid double-counting of the requests, we only look at the ANY part of pending resource requests, which means we might not preempt on behalf of AMs that ask only for specific locations but not any. (**) The ideal balance state is one in which each queue has at least its guaranteed capacity, and the spare capacity is distributed among queues (that wants some) as a weighted fair share. Where the weighting is based on the guaranteed capacity of a queue, and the function runs to a fix point. Tunables of the ProportionalCapacityPreemptionPolicy: # observe-only mode (i.e., log the actions it would take, but behave as read-only) # how frequently to run the policy # how long to wait between preemption and kill of a container #
[jira] [Updated] (YARN-827) Need to make Resource arithmetic methods accessible
[ https://issues.apache.org/jira/browse/YARN-827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-827: Attachment: YARN-827.MiniYARNFix.patch Committing minor fix in config file that was missed in the original patch. MiniYARNCluster fails to start without it. Patch attached. Need to make Resource arithmetic methods accessible --- Key: YARN-827 URL: https://issues.apache.org/jira/browse/YARN-827 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Priority: Critical Fix For: 2.1.0-beta Attachments: YARN-827.1.patch, YARN-827.2.patch, YARN-827.MiniYARNFix.patch, YARN-827.patch org.apache.hadoop.yarn.server.resourcemanager.resource has stuff like Resources and Calculators that help compare/add resources etc. Without these users will be forced to replicate the logic, potentially incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-827) Need to make Resource arithmetic methods accessible
[ https://issues.apache.org/jira/browse/YARN-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691081#comment-13691081 ] Hudson commented on YARN-827: - Integrated in Hadoop-trunk-Commit #4003 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/4003/]) MiniYARNCluster broken after YARN-827 (bikas) (Revision 1495684) Result = SUCCESS bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495684 Files : * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/capacity-scheduler.xml Need to make Resource arithmetic methods accessible --- Key: YARN-827 URL: https://issues.apache.org/jira/browse/YARN-827 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Jian He Priority: Critical Fix For: 2.1.0-beta Attachments: YARN-827.1.patch, YARN-827.2.patch, YARN-827.MiniYARNFix.patch, YARN-827.patch org.apache.hadoop.yarn.server.resourcemanager.resource has stuff like Resources and Calculators that help compare/add resources etc. Without these users will be forced to replicate the logic, potentially incorrectly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-827) Need to make Resource arithmetic methods accessible
[ https://issues.apache.org/jira/browse/YARN-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691104#comment-13691104 ] Hudson commented on YARN-827: - Integrated in Hadoop-Yarn-trunk #248 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/248/]) MiniYARNCluster broken after YARN-827 (bikas) (Revision 1495684) Deleting files missed for YARN-827 (Revision 1495633) YARN-827. Need to make Resource arithmetic methods accessible^CJian He via bikas) (Revision 1495533) Result = FAILURE bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495684 Files : * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/capacity-scheduler.xml bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495633 Files : * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/DefaultResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/DominantResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/ResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/Resources.java bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495533 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DefaultResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DominantResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSAssignment.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java *
[jira] [Commented] (YARN-866) Add test for class ResourceWeights
[ https://issues.apache.org/jira/browse/YARN-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691136#comment-13691136 ] Hudson commented on YARN-866: - Integrated in Hadoop-Hdfs-trunk #1438 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1438/]) YARN-866. Add test for class ResourceWeights. (ywskycn via tucu) (Revision 1495494) Result = FAILURE tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495494 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/resource/TestResourceWeights.java Add test for class ResourceWeights -- Key: YARN-866 URL: https://issues.apache.org/jira/browse/YARN-866 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.1.0-beta Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.2.0 Attachments: Yarn-866.patch, Yarn-866.patch, YARN-866.patch Add test case for the class ResourceWeights -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-866) Add test for class ResourceWeights
[ https://issues.apache.org/jira/browse/YARN-866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691151#comment-13691151 ] Hudson commented on YARN-866: - Integrated in Hadoop-Mapreduce-trunk #1465 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1465/]) YARN-866. Add test for class ResourceWeights. (ywskycn via tucu) (Revision 1495494) Result = FAILURE tucu : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495494 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/resource/TestResourceWeights.java Add test for class ResourceWeights -- Key: YARN-866 URL: https://issues.apache.org/jira/browse/YARN-866 Project: Hadoop YARN Issue Type: Test Affects Versions: 2.1.0-beta Reporter: Wei Yan Assignee: Wei Yan Fix For: 2.2.0 Attachments: Yarn-866.patch, Yarn-866.patch, YARN-866.patch Add test case for the class ResourceWeights -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-827) Need to make Resource arithmetic methods accessible
[ https://issues.apache.org/jira/browse/YARN-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691150#comment-13691150 ] Hudson commented on YARN-827: - Integrated in Hadoop-Mapreduce-trunk #1465 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1465/]) MiniYARNCluster broken after YARN-827 (bikas) (Revision 1495684) Deleting files missed for YARN-827 (Revision 1495633) YARN-827. Need to make Resource arithmetic methods accessible^CJian He via bikas) (Revision 1495533) Result = FAILURE bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495684 Files : * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/resources/capacity-scheduler.xml bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495633 Files : * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/DefaultResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/DominantResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/ResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/resource/Resources.java bikas : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1495533 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DefaultResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/DominantResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/ResourceCalculator.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/resource/Resources.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/QueueMetrics.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSAssignment.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueueUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerContext.java *
[jira] [Updated] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out
[ https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Parker updated YARN-862: --- Attachment: YARN-862-b0.23-v2.patch Correct unit test failures ResourceManager and NodeManager versions should match on node registration or error out --- Key: YARN-862 URL: https://issues.apache.org/jira/browse/YARN-862 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 0.23.8 Reporter: Robert Parker Assignee: Robert Parker Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch For branch-0.23 the versions of the node manager and the resource manager should match to complete a successful registration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-862) ResourceManager and NodeManager versions should match on node registration or error out
[ https://issues.apache.org/jira/browse/YARN-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691171#comment-13691171 ] Hadoop QA commented on YARN-862: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12589256/YARN-862-b0.23-v2.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/1384//console This message is automatically generated. ResourceManager and NodeManager versions should match on node registration or error out --- Key: YARN-862 URL: https://issues.apache.org/jira/browse/YARN-862 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Affects Versions: 0.23.8 Reporter: Robert Parker Assignee: Robert Parker Attachments: YARN-862-b0.23-v1.patch, YARN-862-b0.23-v2.patch For branch-0.23 the versions of the node manager and the resource manager should match to complete a successful registration. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups
[ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691294#comment-13691294 ] Chris Riccomini commented on YARN-864: -- Hey Guys, Container leaking still seems to be happening. [~ojoshi], here's the logs you asked for: {noformat} 10:28:38,753 INFO NodeStatusUpdaterImpl:365 - Node is out of sync with ResourceManager, hence rebooting. 10:28:40,306 INFO NMAuditLogger:89 - USER=criccomi IP=172.18.146.129 OPERATION=Stop Container RequestTARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1371849977601_0001 CONTAINERID=container_1371849977601_0001_02_01 10:28:40,345 INFO NodeManager:229 - Containers still running on shutdown: [container_1371849977601_0001_02_01, container_1371849977601_0001_02_03, container_1371849977601_0002_02_03, container_1371849977601_0003_01_04, container_1371849977601_0004_01_02] 10:28:40,355 INFO Container:835 - Container container_1371849977601_0001_02_01 transitioned from RUNNING to KILLING 10:28:40,375 INFO ContainerLaunch:300 - Cleaning up container container_1371849977601_0001_02_01 10:28:40,376 INFO NodeManager:236 - Waiting for containers to be killed 10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 1, }, state: C_RUNNING, diagnostics: Container killed by the ApplicationMaster.\n, exit_status: -1000, 10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 2, cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 3, cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 4, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:40,377 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 4, cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 2, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 1, }, state: C_RUNNING, diagnostics: Container killed by the ApplicationMaster.\n, exit_status: -1000, 10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 2, cluster_timestamp: 1371849977601, }, attemptId: 2, }, id: 3, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:41,378 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 3, cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 4, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:41,379 INFO NodeStatusUpdaterImpl:265 - Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 4, cluster_timestamp: 1371849977601, }, attemptId: 1, }, id: 2, }, state: C_RUNNING, diagnostics: , exit_status: -1000, 10:28:41,555 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 4230 for container-id container_1371849977601_0001_02_01: 161.0 MB of 512 MB physical memory used; 726.2 MB of 4 GB virtual memory used 10:28:41,802 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 4324 for container-id container_1371849977601_0001_02_03: 522.9 MB of 768 MB physical memory used; 1.1 GB of 6 GB virtual memory used 10:28:41,844 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 5717 for container-id container_1371849977601_0002_02_03: 608.3 MB of 1.3 GB physical memory used; 1.6 GB of 10 GB virtual memory used10:28:41,869 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 26908 for container-id container_1371849977601_0004_01_02: 16.4 GB of 19.3 GB physical memory used; 17.0 GB of 154 GB virtual memory used 10:28:41,896 INFO ContainersMonitorImpl:399 - Memory usage of ProcessTree 27868 for container-id
[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups
[ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691295#comment-13691295 ] Chris Riccomini commented on YARN-864: -- Wondering if this is because of YARN-495? I applied YARN-688, but YARN-495 didn't apply easily to the 2.0.5-alpha branch, so I didn't use it. YARN NM leaking containers with CGroups --- Key: YARN-864 URL: https://issues.apache.org/jira/browse/YARN-864 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and YARN-600. Reporter: Chris Riccomini Attachments: rm-log Hey Guys, I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. When I look in my AM logs for my app (not an MR app master), I see: 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_02 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a new container for the task. The AM has been running steadily the whole time. Here's what the NM logs say: {noformat} 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:619) 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at
[jira] [Updated] (YARN-864) YARN NM leaking containers with CGroups
[ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-864: - Attachment: YARN-864.1.patch patch for NM clean up containers on SHUTDOWN and REBOOT event. YARN NM leaking containers with CGroups --- Key: YARN-864 URL: https://issues.apache.org/jira/browse/YARN-864 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and YARN-600. Reporter: Chris Riccomini Attachments: rm-log, YARN-864.1.patch Hey Guys, I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. When I look in my AM logs for my app (not an MR app master), I see: 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_02 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a new container for the task. The AM has been running steadily the whole time. Here's what the NM logs say: {noformat} 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:619) 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at
[jira] [Commented] (YARN-864) YARN NM leaking containers with CGroups
[ https://issues.apache.org/jira/browse/YARN-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13691344#comment-13691344 ] Jian He commented on YARN-864: -- Hi Chris Yes, the log shows its on REBOOT event. The earlier patch only takes care of SHUTDOWN event, uploaded a new patch for that. YARN NM leaking containers with CGroups --- Key: YARN-864 URL: https://issues.apache.org/jira/browse/YARN-864 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Environment: YARN 2.0.5-alpha with patches applied for YARN-799 and YARN-600. Reporter: Chris Riccomini Attachments: rm-log, YARN-864.1.patch Hey Guys, I'm running YARN 2.0.5-alpha with CGroups and stateful RM turned on, and I'm seeing containers getting leaked by the NMs. I'm not quite sure what's going on -- has anyone seen this before? I'm concerned that maybe it's a mis-understanding on my part about how YARN's lifecycle works. When I look in my AM logs for my app (not an MR app master), I see: 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Got an exit code of -100. This means that container container_1371141151815_0008_03_02 was killed by YARN, either due to being released by the application master or being 'lost' due to node failures etc. 2013-06-19 05:34:22 AppMasterTaskManager [INFO] Released container container_1371141151815_0008_03_02 was assigned task ID 0. Requesting a new container for the task. The AM has been running steadily the whole time. Here's what the NM logs say: {noformat} 05:34:59,783 WARN AsyncDispatcher:109 - Interrupted Exception while stopping java.lang.InterruptedException at java.lang.Object.wait(Native Method) at java.lang.Thread.join(Thread.java:1143) at java.lang.Thread.join(Thread.java:1196) at org.apache.hadoop.yarn.event.AsyncDispatcher.stop(AsyncDispatcher.java:107) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:99) at org.apache.hadoop.yarn.service.CompositeService.stop(CompositeService.java:89) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.stop(NodeManager.java:209) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:336) at org.apache.hadoop.yarn.server.nodemanager.NodeManager.handle(NodeManager.java:61) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:130) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:77) at java.lang.Thread.run(Thread.java:619) 05:35:00,314 WARN ContainersMonitorImpl:463 - org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl is interrupted. Exiting. 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0006_01_001598 05:35:00,434 WARN CgroupsLCEResourcesHandler:166 - Unable to delete cgroup at: /cgroup/cpu/hadoop-yarn/container_1371141151815_0008_03_02 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:242) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:68) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) 05:35:00,434 WARN ContainerLaunch:247 - Failed to launch container. java.io.IOException: java.lang.InterruptedException at org.apache.hadoop.util.Shell.runCommand(Shell.java:205) at org.apache.hadoop.util.Shell.run(Shell.java:129) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:322) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:230) at
[jira] [Assigned] (YARN-771) AMRMClient support for resource blacklisting
[ https://issues.apache.org/jira/browse/YARN-771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du reassigned YARN-771: --- Assignee: Junping Du AMRMClient support for resource blacklisting - Key: YARN-771 URL: https://issues.apache.org/jira/browse/YARN-771 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Junping Du After YARN-750 AMRMClient should support blacklisting via the new YARN API's -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-873) YARNClient.getApplicationReport(unknownAppId) returns a null report
[ https://issues.apache.org/jira/browse/YARN-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-873: Assignee: Xuan Gong YARNClient.getApplicationReport(unknownAppId) returns a null report --- Key: YARN-873 URL: https://issues.apache.org/jira/browse/YARN-873 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.1.0-beta Reporter: Bikas Saha Assignee: Xuan Gong How can the client find out that app does not exist? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-654) AMRMClient: Perform sanity checks for parameters of public methods
[ https://issues.apache.org/jira/browse/YARN-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-654: Assignee: Xuan Gong AMRMClient: Perform sanity checks for parameters of public methods -- Key: YARN-654 URL: https://issues.apache.org/jira/browse/YARN-654 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Assignee: Xuan Gong Fix For: 2.1.0-beta -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-763) AMRMClientAsync should stop heartbeating after receiving shutdown from RM
[ https://issues.apache.org/jira/browse/YARN-763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bikas Saha updated YARN-763: Assignee: Xuan Gong AMRMClientAsync should stop heartbeating after receiving shutdown from RM - Key: YARN-763 URL: https://issues.apache.org/jira/browse/YARN-763 Project: Hadoop YARN Issue Type: Bug Reporter: Bikas Saha Assignee: Xuan Gong -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira