[jira] [Updated] (YARN-2881) Implement PlanFollower for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-2881: Attachment: YARN-2881.001.patch Implements rest of the Fair Scheduler Reservation system. Tested with unit tests added and by using the sample from YARN-2609 and validating in UI from YARN-2664 Implement PlanFollower for FairScheduler Key: YARN-2881 URL: https://issues.apache.org/jira/browse/YARN-2881 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2881.001.patch, YARN-2881.prelim.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1206) as
[ https://issues.apache.org/jira/browse/YARN-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukesh Jha updated YARN-1206: - Summary: as (was: AM container log link broken on NM web page even though local container logs are available) as -- Key: YARN-1206 URL: https://issues.apache.org/jira/browse/YARN-1206 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1206.1.patch, YARN-1206.patch With log aggregation disabled, when container is running, its logs link works properly, but after the application is finished, the link shows 'Container does not exist.' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1206) AM container log link broken on NM web page
[ https://issues.apache.org/jira/browse/YARN-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mukesh Jha updated YARN-1206: - Summary: AM container log link broken on NM web page (was: as) AM container log link broken on NM web page --- Key: YARN-1206 URL: https://issues.apache.org/jira/browse/YARN-1206 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Rohith Priority: Blocker Fix For: 2.4.0 Attachments: YARN-1206.1.patch, YARN-1206.patch With log aggregation disabled, when container is running, its logs link works properly, but after the application is finished, the link shows 'Container does not exist.' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1429#comment-1429 ] Rohith commented on YARN-2340: -- [~jianhe] Kindly review the patch NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby -- Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2340.patch While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2881) Implement PlanFollower for FairScheduler
[ https://issues.apache.org/jira/browse/YARN-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255589#comment-14255589 ] Hadoop QA commented on YARN-2881: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688624/YARN-2881.001.patch against trunk revision ecf1469. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 7 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 17 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6167//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6167//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6167//console This message is automatically generated. Implement PlanFollower for FairScheduler Key: YARN-2881 URL: https://issues.apache.org/jira/browse/YARN-2881 Project: Hadoop YARN Issue Type: Sub-task Components: fairscheduler Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-2881.001.patch, YARN-2881.prelim.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255646#comment-14255646 ] Hudson commented on YARN-2939: -- SUCCESS: Integrated in Hadoop-trunk-Commit #6774 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6774/]) YARN-2939. Fix new findbugs warnings in hadoop-yarn-common. (Li Lu via junping_du) (junping_du: rev a696fbb001b946ae75f3b8e962839c2fd3decfa1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogFormat.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/VisualizeStateMachine.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/PriorityPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Fix For: 2.7.0 Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255763#comment-14255763 ] Hudson commented on YARN-2939: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1981 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1981/]) YARN-2939. Fix new findbugs warnings in hadoop-yarn-common. (Li Lu via junping_du) (junping_du: rev a696fbb001b946ae75f3b8e962839c2fd3decfa1) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogFormat.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/VisualizeStateMachine.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/PriorityPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Fix For: 2.7.0 Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255770#comment-14255770 ] Hudson commented on YARN-2939: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #46 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/46/]) YARN-2939. Fix new findbugs warnings in hadoop-yarn-common. (Li Lu via junping_du) (junping_du: rev a696fbb001b946ae75f3b8e962839c2fd3decfa1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/PriorityPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogFormat.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/VisualizeStateMachine.java Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Fix For: 2.7.0 Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255808#comment-14255808 ] Hudson commented on YARN-2939: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #50 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/50/]) YARN-2939. Fix new findbugs warnings in hadoop-yarn-common. (Li Lu via junping_du) (junping_du: rev a696fbb001b946ae75f3b8e962839c2fd3decfa1) * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/PriorityPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/VisualizeStateMachine.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogFormat.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Fix For: 2.7.0 Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255838#comment-14255838 ] Hudson commented on YARN-2939: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #2000 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2000/]) YARN-2939. Fix new findbugs warnings in hadoop-yarn-common. (Li Lu via junping_du) (junping_du: rev a696fbb001b946ae75f3b8e962839c2fd3decfa1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/VisualizeStateMachine.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/PriorityPBImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestTimelineClient.java * hadoop-yarn-project/hadoop-yarn/dev-support/findbugs-exclude.xml * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/api/impl/TimelineClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/AggregatedLogFormat.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/state/Graph.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/LinuxResourceCalculatorPlugin.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorPlugin.java Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Fix For: 2.7.0 Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-2946: - Attachment: 0003-YARN-2946.patch DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, 0003-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255864#comment-14255864 ] Rohith commented on YARN-2946: -- Thanks [~jianhe] for review.. I updated the patch fixing above review comment.. Kindly review the update patch DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, 0003-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255888#comment-14255888 ] Varun Saxena commented on YARN-2939: [~djp], [~jianhe], if you have time, can you have a look at other 3 similar issues as well. YARN-2940, YARN-2937 and YARN-2938 Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Fix For: 2.7.0 Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-868) YarnClient should set the service address in tokens returned by getRMDelegationToken()
[ https://issues.apache.org/jira/browse/YARN-868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255891#comment-14255891 ] Varun Saxena commented on YARN-868: --- Findbugs to be fixed by YARN-2937 to YARN-2940 YarnClient should set the service address in tokens returned by getRMDelegationToken() -- Key: YARN-868 URL: https://issues.apache.org/jira/browse/YARN-868 Project: Hadoop YARN Issue Type: Bug Reporter: Hitesh Shah Assignee: Varun Saxena Attachments: YARN-868.patch Either the client should set this information into the token or the client layer should expose an api that returns the service address. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255964#comment-14255964 ] Hadoop QA commented on YARN-2946: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688659/0003-YARN-2946.patch against trunk revision a696fbb. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6168//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6168//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6168//console This message is automatically generated. DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, 0003-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2971) RM uses conf instead of token service address to renew timeline delegation tokens
[ https://issues.apache.org/jira/browse/YARN-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255968#comment-14255968 ] Zhijie Shen commented on YARN-2971: --- [~jeagles], one question: why will serviceURI be different from restURI? I suppose service will be [host:port] of restURI if getting and renewing the DT at the same ATS? RM uses conf instead of token service address to renew timeline delegation tokens - Key: YARN-2971 URL: https://issues.apache.org/jira/browse/YARN-2971 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Affects Versions: 2.6.0 Reporter: Jonathan Eagles Assignee: Jonathan Eagles Attachments: YARN-2971-v1.patch The TimelineClientImpl renewDelegationToken uses the incorrect webaddress to renew Timeline DelegationTokens. It should read the service address out of the token to renew the delegation token. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-2.patch Thanks [~wangda] for the review. Make sense. Updating the patch. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2970) NodeLabel operations in RMAdmin CLI get missing in help command.
[ https://issues.apache.org/jira/browse/YARN-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256052#comment-14256052 ] Varun Saxena commented on YARN-2970: [~djp], kindly review. NodeLabel operations in RMAdmin CLI get missing in help command. Key: YARN-2970 URL: https://issues.apache.org/jira/browse/YARN-2970 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Varun Saxena Priority: Minor Attachments: YARN-2970.patch NodeLabel operations in RMAdmin CLI get missing in help command when I am debugging YARN-313, we should add them on as other cmds: {noformat} yarn rmadmin [-refreshQueues] [-refreshNodes] [-refreshResources] [-refreshSuperUserGroupsConfiguration] [-refreshUserToGroupsMappings] [-refreshAdminAcls] [-refreshServiceAcl] [-getGroup [username]] [-updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]) [-help [cmd]] -refreshQueues: Reload the queues' acls, states and scheduler specific properties. ResourceManager will reload the mapred-queues configuration file. -refreshNodes: Refresh the hosts information at the ResourceManager. -refreshResources: Refresh resources of NodeManagers at the ResourceManager. -refreshSuperUserGroupsConfiguration: Refresh superuser proxy groups mappings -refreshUserToGroupsMappings: Refresh user-to-groups mappings -refreshAdminAcls: Refresh acls for administration of ResourceManager -refreshServiceAcl: Reload the service-level authorization policy file. ResoureceManager will reload the authorization policy file. -getGroups [username]: Get the groups which given user belongs to. -updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]): Update resource on specific node. -help [cmd]: Displays help for the given command or all commands if none is specified. -addToClusterNodeLabels [label1,label2,label3] (label splitted by ,): add to cluster node labels -removeFromClusterNodeLabels [label1,label2,label3] (label splitted by ,): remove from cluster node labels -replaceLabelsOnNode [node1:port,label1,label2 node2:port,label1,label2]: replace labels on nodes -directlyAccessNodeLabelStore: Directly access node label store, with this option, all node label related operations will not connect RM. Instead, they will access/modify stored node labels directly. By default, it is false (access via RM). AND PLEASE NOTE: if you configured yarn.node-labels.fs-store.root-dir to a local directory (instead of NFS or HDFS), this option will only work when the command run on the machine where RM is running. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2420) Fair Scheduler: dynamically update yarn.scheduler.fair.max.assign based on cluster load
[ https://issues.apache.org/jira/browse/YARN-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256070#comment-14256070 ] Karthik Kambatla commented on YARN-2420: Should we resolve this as Won't Fix since we want to move to a world with only continuous scheduling? Fair Scheduler: dynamically update yarn.scheduler.fair.max.assign based on cluster load --- Key: YARN-2420 URL: https://issues.apache.org/jira/browse/YARN-2420 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2420) Fair Scheduler: dynamically update yarn.scheduler.fair.max.assign based on cluster load
[ https://issues.apache.org/jira/browse/YARN-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan resolved YARN-2420. --- Resolution: Won't Fix Fair Scheduler: dynamically update yarn.scheduler.fair.max.assign based on cluster load --- Key: YARN-2420 URL: https://issues.apache.org/jira/browse/YARN-2420 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2420) Fair Scheduler: dynamically update yarn.scheduler.fair.max.assign based on cluster load
[ https://issues.apache.org/jira/browse/YARN-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256072#comment-14256072 ] Wei Yan commented on YARN-2420: --- sure, I'll close it. Fair Scheduler: dynamically update yarn.scheduler.fair.max.assign based on cluster load --- Key: YARN-2420 URL: https://issues.apache.org/jira/browse/YARN-2420 Project: Hadoop YARN Issue Type: Improvement Reporter: Wei Yan Assignee: Wei Yan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256111#comment-14256111 ] Hadoop QA commented on YARN-2933: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688695/YARN-2933-2.patch against trunk revision a696fbb. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 1 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/6169//artifact/patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 16 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6169//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6169//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6169//console This message is automatically generated. Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2984) Metrics for container's actual memory usage
Karthik Kambatla created YARN-2984: -- Summary: Metrics for container's actual memory usage Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla updated YARN-2984: --- Attachment: yarn-2984-prelim.patch Posted a prelim patch (yarn-2984-prelim.patch) that captures my intention. I am thinking of adding a config to enable/disable this tracking. [~rgrandl] - can you take a look and see if this can co-exist with YARN-2965? Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2938) Fix new findbugs warnings in hadoop-yarn-resourcemanager and hadoop-yarn-applicationhistoryservice
[ https://issues.apache.org/jira/browse/YARN-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256132#comment-14256132 ] Zhijie Shen commented on YARN-2938: --- [~varun_saxena]. The explanation is so detailed, and sounds good. Would you please rebase the patch? bq. But as it is a rank 18 findbugs issue, didnt bother changing it. If you want, can fix it. Yes, please Fix new findbugs warnings in hadoop-yarn-resourcemanager and hadoop-yarn-applicationhistoryservice -- Key: YARN-2938 URL: https://issues.apache.org/jira/browse/YARN-2938 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Varun Saxena Fix For: 2.7.0 Attachments: FindBugs Report.html, YARN-2938.001.patch, YARN-2938.002.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2937) Fix new findbugs warnings in hadoop-yarn-nodemanager
[ https://issues.apache.org/jira/browse/YARN-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256229#comment-14256229 ] Zhijie Shen commented on YARN-2937: --- [~varun_saxena], thanks for your explanation, which makes sense to me. Again, this patch doesn't apply too. Please kindly update it. And there's one additional comment. The [document|http://docs.oracle.com/javase/7/docs/api/java/io/PrintWriter.html] says {{Methods in this class never throw I/O exceptions, although some of its constructors may. The client may inquire as to whether any errors have occurred by invoking checkError().}} Previously if I/O error occurs, we will always throw exception. Now some errors will be just recorded as WARN logs. IMHO, we should throw IOException here. {code} +if(pw.checkError()) { + LOG.warn(Error while closing cgroup file + path); {code} Fix new findbugs warnings in hadoop-yarn-nodemanager Key: YARN-2937 URL: https://issues.apache.org/jira/browse/YARN-2937 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2937.001.patch, YARN-2937.002.patch, YARN-2937.003.patch, YARN-2937.004.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-314) Schedulers should allow resource requests of different sizes at the same priority and location
[ https://issues.apache.org/jira/browse/YARN-314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-314: - Assignee: (was: Karthik Kambatla) I haven't been able to spend more time to clean this up. Marking it unassigned in case anyone wants to work on this. Schedulers should allow resource requests of different sizes at the same priority and location -- Key: YARN-314 URL: https://issues.apache.org/jira/browse/YARN-314 Project: Hadoop YARN Issue Type: Sub-task Components: scheduler Affects Versions: 2.0.2-alpha Reporter: Sandy Ryza Fix For: 2.7.0 Attachments: yarn-314-prelim.patch Currently, resource requests for the same container and locality are expected to all be the same size. While it it doesn't look like it's needed for apps currently, and can be circumvented by specifying different priorities if absolutely necessary, it seems to me that the ability to request containers with different resource requirements at the same priority level should be there for the future and for completeness sake. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256362#comment-14256362 ] Jian He commented on YARN-2340: --- looks good, +1 NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby -- Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2340.patch While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2920) CapacityScheduler should be notified when labels on nodes changed
[ https://issues.apache.org/jira/browse/YARN-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256368#comment-14256368 ] Jian He commented on YARN-2920: --- looks good, +1 CapacityScheduler should be notified when labels on nodes changed - Key: YARN-2920 URL: https://issues.apache.org/jira/browse/YARN-2920 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2920.1.patch, YARN-2920.2.patch, YARN-2920.3.patch, YARN-2920.4.patch, YARN-2920.5.patch, YARN-2920.6.patch Currently, labels on nodes changes will only be handled by RMNodeLabelsManager, but that is not enough upon labels on nodes changes: - Scheduler should be able to do take actions to running containers. (Like kill/preempt/do-nothing) - Used / available capacity in scheduler should be updated for future planning. We need add a new event to pass such updates to scheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2920) CapacityScheduler should be notified when labels on nodes changed
[ https://issues.apache.org/jira/browse/YARN-2920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256391#comment-14256391 ] Hudson commented on YARN-2920: -- FAILURE: Integrated in Hadoop-trunk-Commit #6776 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6776/]) YARN-2920. Changed CapacityScheduler to kill containers on nodes where node labels are changed. Contributed by Wangda Tan (jianhe: rev fdf042dfffa4d2474e3cac86cfb8fe9ee4648beb) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/SchedulerEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestCapacitySchedulerNodeLabelUpdate.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/event/NodeLabelsUpdateSchedulerEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/ResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmnode/RMNodeImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CSQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java * hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/nodemanager/NodeInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerNode.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-tools/hadoop-sls/src/main/java/org/apache/hadoop/yarn/sls/scheduler/RMNodeWrapper.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/AbstractCSQueue.java CapacityScheduler should be notified when labels on nodes changed - Key: YARN-2920 URL: https://issues.apache.org/jira/browse/YARN-2920 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.7.0 Attachments: YARN-2920.1.patch, YARN-2920.2.patch, YARN-2920.3.patch, YARN-2920.4.patch, YARN-2920.5.patch, YARN-2920.6.patch Currently, labels on nodes changes will only be handled by RMNodeLabelsManager, but that is not enough upon labels on nodes changes: - Scheduler should be able to do take actions to running containers. (Like kill/preempt/do-nothing) - Used / available capacity in scheduler should be updated for future planning. We need add a new event to pass such updates to scheduler -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2929) Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows support
[ https://issues.apache.org/jira/browse/YARN-2929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256390#comment-14256390 ] Chris Nauroth commented on YARN-2929: - [~ozawa], thank you for providing the additional details. Right now, the typical workflow is that the application submission context controls the Java classpath by setting the {{CLASSPATH}} environment variable. If we take the example of MapReduce, the relevant code is in {{YARNRunner}} and {{MRApps}}. There is support in there for handling environment variables in cross-platform application submission. However, even putting that aside, I believe there is no problem with using '/' for a Windows job submission if you use this technique. The NodeManager ultimately translates the classpath through {{Path}} and bundles the whole classpath into a jar file manifest to be referenced by the running container. I believe the only problem shown in the example is that the application submission is trying to set classpath by command-line argument. Is it possible to switch to using the {{CLASSPATH}} environment variable technique, similar to how the MapReduce code does it? If yes, then there is no need for the proposed patch. There are also some other potential issues with trying to pass classpath on the command line in Windows. It's very easy to hit the Windows maximum command line length limitation of 8191 characters. NodeManager already has the logic to work around this by bundling the classpath into a jar file manifest, and you'd get that for free by using the environment variable technique. Adding separator ApplicationConstants.FILE_PATH_SEPARATOR for better Windows support Key: YARN-2929 URL: https://issues.apache.org/jira/browse/YARN-2929 Project: Hadoop YARN Issue Type: Improvement Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2929.001.patch Some frameworks like Spark is tackling to run jobs on Windows(SPARK-1825). For better multiple platform support, we should introduce ApplicationConstants.FILE_PATH_SEPARATOR for making filepath platform-independent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-810: - Attachment: YARN-810-3.patch Rebase a new patch for review. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810-3.patch, YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit. Implementation: What do you guys think about introducing a variable to
[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-810: - Attachment: YARN-810-4.patch Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit. Implementation: What do you guys think about introducing a variable to
[jira] [Commented] (YARN-2970) NodeLabel operations in RMAdmin CLI get missing in help command.
[ https://issues.apache.org/jira/browse/YARN-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256419#comment-14256419 ] Junping Du commented on YARN-2970: -- Thanks [~varun_saxena] for the patch! I think we should keep sequence of commands consistent here with usage info below. I would prefer [help] show up in the last as in general case, so we probably should change RMAdminCLI.ADMIN_USAGE. Other looks fine. NodeLabel operations in RMAdmin CLI get missing in help command. Key: YARN-2970 URL: https://issues.apache.org/jira/browse/YARN-2970 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Affects Versions: 2.6.0 Reporter: Junping Du Assignee: Varun Saxena Priority: Minor Attachments: YARN-2970.patch NodeLabel operations in RMAdmin CLI get missing in help command when I am debugging YARN-313, we should add them on as other cmds: {noformat} yarn rmadmin [-refreshQueues] [-refreshNodes] [-refreshResources] [-refreshSuperUserGroupsConfiguration] [-refreshUserToGroupsMappings] [-refreshAdminAcls] [-refreshServiceAcl] [-getGroup [username]] [-updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]) [-help [cmd]] -refreshQueues: Reload the queues' acls, states and scheduler specific properties. ResourceManager will reload the mapred-queues configuration file. -refreshNodes: Refresh the hosts information at the ResourceManager. -refreshResources: Refresh resources of NodeManagers at the ResourceManager. -refreshSuperUserGroupsConfiguration: Refresh superuser proxy groups mappings -refreshUserToGroupsMappings: Refresh user-to-groups mappings -refreshAdminAcls: Refresh acls for administration of ResourceManager -refreshServiceAcl: Reload the service-level authorization policy file. ResoureceManager will reload the authorization policy file. -getGroups [username]: Get the groups which given user belongs to. -updateNodeResource [NodeID] [MemSize] [vCores] ([OvercommitTimeout]): Update resource on specific node. -help [cmd]: Displays help for the given command or all commands if none is specified. -addToClusterNodeLabels [label1,label2,label3] (label splitted by ,): add to cluster node labels -removeFromClusterNodeLabels [label1,label2,label3] (label splitted by ,): remove from cluster node labels -replaceLabelsOnNode [node1:port,label1,label2 node2:port,label1,label2]: replace labels on nodes -directlyAccessNodeLabelStore: Directly access node label store, with this option, all node label related operations will not connect RM. Instead, they will access/modify stored node labels directly. By default, it is false (access via RM). AND PLEASE NOTE: if you configured yarn.node-labels.fs-store.root-dir to a local directory (instead of NFS or HDFS), this option will only work when the command run on the machine where RM is running. {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2939) Fix new findbugs warnings in hadoop-yarn-common
[ https://issues.apache.org/jira/browse/YARN-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256442#comment-14256442 ] Junping Du commented on YARN-2939: -- bq. Junping Du, Jian He, if you have time, can you have a look at other 3 similar issues as well. YARN-2940, YARN-2937 and YARN-2938 Sure. I will review them today. Fix new findbugs warnings in hadoop-yarn-common --- Key: YARN-2939 URL: https://issues.apache.org/jira/browse/YARN-2939 Project: Hadoop YARN Issue Type: Improvement Reporter: Varun Saxena Assignee: Li Lu Labels: findbugs Fix For: 2.7.0 Attachments: YARN-2939-120914.patch, YARN-2939-121614.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256508#comment-14256508 ] Hadoop QA commented on YARN-810: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688757/YARN-810-3.patch against trunk revision fdf042d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 52 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6170//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6170//artifact/patchprocess/newPatchFindbugsWarningshadoop-sls.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6170//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6170//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6170//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6170//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-applications-distributedshell.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6170//console This message is automatically generated. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat}
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256529#comment-14256529 ] Hadoop QA commented on YARN-810: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688764/YARN-810-4.patch against trunk revision fdf042d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 52 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6171//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6171//artifact/patchprocess/newPatchFindbugsWarningshadoop-sls.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6171//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-applications-distributedshell.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6171//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6171//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6171//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6171//console This message is automatically generated. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you
[jira] [Commented] (YARN-914) Support graceful decommission of nodemanager
[ https://issues.apache.org/jira/browse/YARN-914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256547#comment-14256547 ] Junping Du commented on YARN-914: - Hi [~mingma], Thanks for comments here. bq. So YARN will reduce the capacity of the nodes as part of the decomission process until all its map output are fetched or until all the applications the node touches have completed? Yes. I am not sure if it is necessary for YARN to mark additional decommissioned on the node as node's resource is already updated to 0, and no container will get chance to be allocated on the node. Auxiliary service should still be running which shouldn't consume much resource if no request of service. bq. In addition, it will be interesting to understand how you handle long running jobs. Do you mean long-running services? First, I think we should support a timeout in drain resources of the node (ResourceOption already has timeout in design). So running containers should be preempted if run out of time. Second, we should support special container tag for the long running services (some discussions in YARN-1039) so we don't have to waste time to wait container finish until timeout. Third, in prospective of operation, we could add long-running label to specific nodes and try not to do decommission on nodes with long-running tag. Let me know if this make sense to you. Support graceful decommission of nodemanager Key: YARN-914 URL: https://issues.apache.org/jira/browse/YARN-914 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 2.0.4-alpha Reporter: Luke Lu Assignee: Junping Du When NMs are decommissioned for non-fault reasons (capacity change etc.), it's desirable to minimize the impact to running applications. Currently if a NM is decommissioned, all running containers on the NM need to be rescheduled on other NMs. Further more, for finished map tasks, if their map output are not fetched by the reducers of the job, these map tasks will need to be rerun as well. We propose to introduce a mechanism to optionally gracefully decommission a node manager. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256584#comment-14256584 ] Hadoop QA commented on YARN-2340: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12687993/0001-YARN-2340.patch against trunk revision fdf042d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.ahs.TestRMApplicationHistoryWriter Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6173//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6173//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6173//console This message is automatically generated. NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby -- Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2340.patch While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256589#comment-14256589 ] Rohith commented on YARN-2946: -- Checked test failures, TestContainerAllocation is fails intermediately in trunk. And I am looking into TestRMRestart test failure. DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, 0003-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256592#comment-14256592 ] Hadoop QA commented on YARN-2946: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12688659/0003-YARN-2946.patch against trunk revision fdf042d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 15 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6172//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6172//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6172//console This message is automatically generated. DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, 0003-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2946) DeadLocks in RMStateStore-ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-2946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256618#comment-14256618 ] Rohith commented on YARN-2946: -- Since handleStoreEvent() called from event dispatcher for RMApp Store events and syncronously for DT store, TestRMRestart was overriding handleStoreEvent() simulate test scnario which was causing start up failure. Will correct test case and update patch. DeadLocks in RMStateStore-ZKRMStateStore -- Key: YARN-2946 URL: https://issues.apache.org/jira/browse/YARN-2946 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.7.0 Reporter: Rohith Assignee: Rohith Priority: Blocker Attachments: 0001-YARN-2946.patch, 0001-YARN-2946.patch, 0002-YARN-2946.patch, 0003-YARN-2946.patch, RM_BeforeFix_Deadlock_cycle_1.png, RM_BeforeFix_Deadlock_cycle_2.png, TestYARN2946.java Found one deadlock in ZKRMStateStore. # Initial stage zkClient is null because of zk disconnected event. # When ZKRMstatestore#runWithCheck() wait(zkSessionTimeout) for zkClient to re establish zookeeper connection either via synconnected or expired event, it is highly possible that any other thred can obtain lock on {{ZKRMStateStore.this}} from state machine transition events. This cause Deadlock in ZKRMStateStore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256631#comment-14256631 ] Rohith commented on YARN-2340: -- There so many tests are failing randomly in trunk!! NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby -- Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2340.patch While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256637#comment-14256637 ] Jian He commented on YARN-2340: --- right, we should spend time fixing these.. NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby -- Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Attachments: 0001-YARN-2340.patch While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2340) NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby
[ https://issues.apache.org/jira/browse/YARN-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14256641#comment-14256641 ] Hudson commented on YARN-2340: -- FAILURE: Integrated in Hadoop-trunk-Commit #6777 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6777/]) YARN-2340. Fixed NPE when queue is stopped during RM restart. Contributed by Rohith Sharmaks (jianhe: rev 0d89859b51157078cc504ac81dc8aa75ce6b1782) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestWorkPreservingRMRestart.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java NPE thrown when RM restart after queue is STOPPED. There after RM can not recovery application's and remain in standby -- Key: YARN-2340 URL: https://issues.apache.org/jira/browse/YARN-2340 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.4.1 Environment: Capacityscheduler with Queue a, b Reporter: Nishan Shetty Assignee: Rohith Priority: Critical Fix For: 2.7.0 Attachments: 0001-YARN-2340.patch While job is in progress make Queue state as STOPPED and then restart RM Observe that standby RM fails to come up as acive throwing below NPE 2014-07-23 18:43:24,432 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1406116264351_0014_02 State change from NEW to SUBMITTED 2014-07-23 18:43:24,433 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.addApplicationAttempt(CapacityScheduler.java:568) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:916) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:602) at java.lang.Thread.run(Thread.java:662) 2014-07-23 18:43:24,434 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. -- This message was sent by Atlassian JIRA (v6.3.4#6332)