[jira] [Commented] (YARN-3424) Reduce log for ContainerMonitorImpl resoure monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390052#comment-14390052 ] Hadoop QA commented on YARN-3424: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708505/YARN-3424.001.patch against trunk revision 2daa478. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7186//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7186//console This message is automatically generated. Reduce log for ContainerMonitorImpl resoure monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390076#comment-14390076 ] Arun Suresh commented on YARN-2962: --- .. alternative to starting the index from the front. ZKRMStateStore: Limit the number of znodes under a znode Key: YARN-2962 URL: https://issues.apache.org/jira/browse/YARN-2962 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-2962.01.patch We ran into this issue where we were hitting the default ZK server message size configs, primarily because the message had too many znodes even though they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390156#comment-14390156 ] Hadoop QA commented on YARN-3429: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708614/YARN-3429.000.patch against trunk revision 2daa478. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7187//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7187//console This message is automatically generated. TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3424: - Summary: Change logs for ContainerMonitorImpl's resourse monitoring from info to debug (was: Reduce log for ContainerMonitorImpl resoure monitoring from info to debug) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3429) TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken
[ https://issues.apache.org/jira/browse/YARN-3429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3429: Attachment: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken Key: YARN-3429 URL: https://issues.apache.org/jira/browse/YARN-3429 Project: Hadoop YARN Issue Type: Bug Components: test Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-3429.000.patch TestAMRMTokens.testTokenExpiry fails Intermittently with error message:Invalid AMRMToken from appattempt_1427804754787_0001_01 The error logs is at https://builds.apache.org/job/PreCommit-YARN-Build/7172//testReport/org.apache.hadoop.yarn.server.resourcemanager.security/TestAMRMTokens/testTokenExpiry_1_/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390128#comment-14390128 ] Rohith commented on YARN-3410: -- Just like YARN-2131 is handled , I think there is choice between start up option vs admin support. If both are in sync, then it would be better. YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3424: - Affects Version/s: 2.7.0 Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vishal.rajan updated YARN-2624: --- Target Version/s: (was: 2.6.0) Affects Version/s: 2.6.0 Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0, 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390319#comment-14390319 ] vishal.rajan commented on YARN-2624: please verify and reopen the jira Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0, 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi Ozawa updated YARN-3424: - Issue Type: Improvement (was: Bug) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Devaraj K updated YARN-3225: Attachment: YARN-3225-3.patch New parameter or CLI for decommissioning node gracefully in RMAdmin CLI --- Key: YARN-3225 URL: https://issues.apache.org/jira/browse/YARN-3225 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Devaraj K Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, YARN-3225.patch, YARN-914.patch New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2624) Resource Localization fails on a cluster due to existing cache directories
[ https://issues.apache.org/jira/browse/YARN-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390305#comment-14390305 ] vishal.rajan commented on YARN-2624: seems like this issue still persist in yarn 2.6.0 under certain conditions. Dump of the log relating to this issue. 15/04/01 12:13:20 ERROR test.Job: Task error: Rename cannot overwrite non empty destination directory /grid/6/yarn/local/usercache/azkaban/filecache/344860 java.io.IOException: Rename cannot overwrite non empty destination directory /grid/6/yarn/local/usercache/azkaban/filecache/344860 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:909) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) = yarn version : hadoop-2-2-0-0-2041-yarn 2.6.0.2.2.0.0-2041 = This node was taken OOR for maintanance, and when it was added back to the cluster, seems like this 344860 directory was not removed before assigning it to the new container. Resource Localization fails on a cluster due to existing cache directories -- Key: YARN-2624 URL: https://issues.apache.org/jira/browse/YARN-2624 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.1 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2624.001.patch, YARN-2624.001.patch We have found resource localization fails on a cluster with following error in certain cases. {noformat} INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Failed to download rsrc { { hdfs://blahhostname:8020/tmp/hive-hive/hive_2014-09-29_14-55-45_184_6531377394813896912-12/-mr-10004/95a07b90-2448-48fc-bcda-cdb7400b4975/map.xml, 1412027745352, FILE, null },pending,[(container_1411670948067_0009_02_01)],443533288192637,DOWNLOADING} java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/filecache/27 at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716) at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228) at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659) at org.apache.hadoop.fs.FileContext.rename(FileContext.java:906) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:366) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:59) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3261) rewrite resourcemanager restart doc to remove roadmap bits
[ https://issues.apache.org/jira/browse/YARN-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390258#comment-14390258 ] Rohith commented on YARN-3261: -- Thanks [~gururaj] for the patch.. +1(non-binding) for the change. rewrite resourcemanager restart doc to remove roadmap bits --- Key: YARN-3261 URL: https://issues.apache.org/jira/browse/YARN-3261 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer Assignee: Gururaj Shetty Attachments: YARN-3261.01.patch Another mixture of roadmap and instruction manual that seems to be ever present in a lot of the recently written documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390243#comment-14390243 ] Hudson commented on YARN-3424: -- FAILURE: Integrated in Hadoop-trunk-Commit #7482 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7482/]) YARN-3424. Change logs for ContainerMonitorImpl's resourse monitoring from info to debug. Contributed by Anubhav Dhoot. (ozawa: rev c69ba81497ae4da329ddb34ba712a64a7eec479f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.8.0 Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3286) Cleanup RMNode#ReconnectNodeTransition
[ https://issues.apache.org/jira/browse/YARN-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3286: - Attachment: 0001-YARN-3286.patch Cleanup RMNode#ReconnectNodeTransition -- Key: YARN-3286 URL: https://issues.apache.org/jira/browse/YARN-3286 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Rohith Assignee: Rohith Attachments: 0001-YARN-3286.patch, YARN-3286-test-only.patch RMNode#ReconnectNodeTransition has messed up for every ReconnectedEvent. This part of the code can be clean up where we do not require to remove node and add new node every time. Supporting to above point, in the issue discussion YARN-3222 mentioned in the comment [link1|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14339799page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14339799] and [link2|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14344739page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14344739] Clean up can do the following things # It always remove an old node and add a new node. This is not really required, instead old node can be updated with new values. # RMNode#totalCapability has stale capability after NM is reconnected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-3430: --- Assignee: Xuan Gong RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3430: Attachment: YARN-3430.1.patch trivial patch without testcase RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3286) Cleanup RMNode#ReconnectNodeTransition
[ https://issues.apache.org/jira/browse/YARN-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohith updated YARN-3286: - Description: RMNode#ReconnectNodeTransition has messed up for every ReconnectedEvent. This part of the code can be clean up where we do not require to remove node and add new node every time. Supporting to above point, in the issue discussion YARN-3222 mentioned in the comment [link1|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14339799page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14339799] and [link2|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14344739page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14344739] Clean up can do the following things # It always remove an old node and add a new node. This is not really required, instead old node can be updated with new values. # RMNode#totalCapability has stale capability after NM is reconnected. was: This is found while fixing YARN-3222 mentioned in the comment [link1|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14339799page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14339799] and [link2|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14344739page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14344739] And RMNode#ReconnectNodeTransition clean up : It always remove an old node and add a new node. This need to be examined whether this is really required. Target Version/s: 2.8.0 Issue Type: Improvement (was: Bug) Summary: Cleanup RMNode#ReconnectNodeTransition (was: RMNode#totalCapability has stale capability after NM is reconnected.) Cleanup RMNode#ReconnectNodeTransition -- Key: YARN-3286 URL: https://issues.apache.org/jira/browse/YARN-3286 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Rohith Assignee: Rohith Attachments: YARN-3286-test-only.patch RMNode#ReconnectNodeTransition has messed up for every ReconnectedEvent. This part of the code can be clean up where we do not require to remove node and add new node every time. Supporting to above point, in the issue discussion YARN-3222 mentioned in the comment [link1|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14339799page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14339799] and [link2|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14344739page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14344739] Clean up can do the following things # It always remove an old node and add a new node. This is not really required, instead old node can be updated with new values. # RMNode#totalCapability has stale capability after NM is reconnected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3416) deadlock in a job between map and reduce cores allocation
[ https://issues.apache.org/jira/browse/YARN-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mai shurong updated YARN-3416: -- Attachment: queue_with_max333cores.png queue_with_max263cores.png queue_with_max163cores.png queue_with_max163cores.png : submit a job to a queue with max 163 cores queue_with_max263cores.png : submit a job to a queue with max 263 cores queue_with_max333cores.png : submit a job to a queue with max 333 cores deadlock in a job between map and reduce cores allocation -- Key: YARN-3416 URL: https://issues.apache.org/jira/browse/YARN-3416 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: mai shurong Priority: Critical Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, queue_with_max163cores.png, queue_with_max263cores.png, queue_with_max333cores.png I submit a big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with 300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied 300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the 300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job is blocked, and the later job in the queue cannot run because no available cores in the queue. I think there is the similar issue for memory of a queue . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3416) deadlock in a job between map and reduce cores allocation
[ https://issues.apache.org/jira/browse/YARN-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390466#comment-14390466 ] mai shurong commented on YARN-3416: --- I found a new case today. I submitted a more larger job with 5800 maps and 380 reduces to a queue which has max 263 cores. Even though no map fail, a deadlock of map and reduce cores allocation always occured when I tried several times. And I tried to submitted to other queues, as long as reduces of a job is more than max cores of the queue , deadlock always happened. I attach the printscreens of deadlock jobs, and attach the head 10 line log (AM_log_head10.txt.gz) and tail 10 line (AM_log_tail10.txt.gz) of AM log of one deadlock job. The parameter mapreduce.job.reduce.slowstart.completedmaps is 0.5. deadlock in a job between map and reduce cores allocation -- Key: YARN-3416 URL: https://issues.apache.org/jira/browse/YARN-3416 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: mai shurong Priority: Critical Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz, queue_with_max163cores.png, queue_with_max263cores.png, queue_with_max333cores.png I submit a big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with 300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied 300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the 300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job is blocked, and the later job in the queue cannot run because no available cores in the queue. I think there is the similar issue for memory of a queue . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
Xuan Gong created YARN-3430: --- Summary: RMAppAttempt headroom data is missing in RM Web UI Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2962) ZKRMStateStore: Limit the number of znodes under a znode
[ https://issues.apache.org/jira/browse/YARN-2962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390074#comment-14390074 ] Arun Suresh commented on YARN-2962: --- Yup.. agreed, star index from the end is a better ZKRMStateStore: Limit the number of znodes under a znode Key: YARN-2962 URL: https://issues.apache.org/jira/browse/YARN-2962 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Varun Saxena Priority: Critical Attachments: YARN-2962.01.patch We ran into this issue where we were hitting the default ZK server message size configs, primarily because the message had too many znodes even though they individually they were all small. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2268) Disallow formatting the RMStateStore when there is an RM running
[ https://issues.apache.org/jira/browse/YARN-2268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390122#comment-14390122 ] Rohith commented on YARN-2268: -- Thinking on this jira, getting many questions. # How to identify RM is running since RM can be formated from anywhere in the cluster? # In HA, for each rm-ids to be checked for serviceState. This would result in time consuming for each hosts retry would take time. If switch happens in the middle while checking rm-ids, it would give wrong result that all RM's are in standby. I think if admin support is there, 1st can be solved easily. Disallow formatting the RMStateStore when there is an RM running Key: YARN-2268 URL: https://issues.apache.org/jira/browse/YARN-2268 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Rohith YARN-2131 adds a way to format the RMStateStore. However, it can be a problem if we format the store while an RM is actively using it. It would be nice to fail the format if there is an RM running and using this store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Reduce log for ContainerMonitorImpl resoure monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390226#comment-14390226 ] Tsuyoshi Ozawa commented on YARN-3424: -- +1 with minor indentation fix on my local. Committing this shortly. Reduce log for ContainerMonitorImpl resoure monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390471#comment-14390471 ] Peng Zhang commented on YARN-3405: -- bq. 2. if parent's usage reached its fair share, it will not propagate preemption request upside again. So preemption request in parent queue means preemption needed between its children. make above statement more clear: If request from children added with current usage less than fair share, parent queue will propagate request upside. This means current queue is under fair share, it need preempt from its sibling that who is over scheduled. When the amount reached current queue's fair share, the above request amount will be stored on current queue. This means these request amount need happen between current queue's children, FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390407#comment-14390407 ] Xuan Gong commented on YARN-3248: - +1 lgtm. Will commit Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390442#comment-14390442 ] Hudson commented on YARN-3424: -- FAILURE: Integrated in Hadoop-Yarn-trunk #884 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/884/]) YARN-3424. Change logs for ContainerMonitorImpl's resourse monitoring from info to debug. Contributed by Anubhav Dhoot. (ozawa: rev c69ba81497ae4da329ddb34ba712a64a7eec479f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.8.0 Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters
[ https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390436#comment-14390436 ] Hudson commented on YARN-3304: -- FAILURE: Integrated in Hadoop-Yarn-trunk #884 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/884/]) YARN-3304. Addendum patch. Cleaning up ResourceCalculatorProcessTree APIs for public use and removing inconsistencies in the default values. (Junping Du and Karthik Kambatla via vinodkv) (vinodkv: rev 7610925e90155dfe5edce05da31574e4fb81b948) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestProcfsBasedProcessTree.java ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters Key: YARN-3304 URL: https://issues.apache.org/jira/browse/YARN-3304 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3304-appendix-v2.patch, YARN-3304-appendix-v3.patch, YARN-3304-appendix-v4.patch, YARN-3304-appendix.patch, YARN-3304-v2.patch, YARN-3304-v3.patch, YARN-3304-v4-boolean-way.patch, YARN-3304-v4-negative-way-MR.patch, YARN-3304-v4-negtive-value-way.patch, YARN-3304-v6-no-rename.patch, YARN-3304-v6-with-rename.patch, YARN-3304-v7.patch, YARN-3304-v8.patch, YARN-3304.patch, yarn-3304-5.patch Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for unavailable case while other resource metrics are return 0 in the same case which sounds inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3412) RM tests should use MockRM where possible
[ https://issues.apache.org/jira/browse/YARN-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390443#comment-14390443 ] Hudson commented on YARN-3412: -- FAILURE: Integrated in Hadoop-Yarn-trunk #884 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/884/]) YARN-3412. RM tests should use MockRM where possible. (kasha) (kasha: rev 79f7f2aabfd7a69722748850f4d3b1ff54af7556) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerEventLog.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/TestSchedulingMonitor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java RM tests should use MockRM where possible - Key: YARN-3412 URL: https://issues.apache.org/jira/browse/YARN-3412 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.8.0 Attachments: yarn-3412-1.patch Noticed TestZKRMStateStore and TestMoveApplication fail when running on a mac, due to not being able to start the webapp. There are a few other tests that could use MockRM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3416) deadlock in a job between map and reduce cores allocation
[ https://issues.apache.org/jira/browse/YARN-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mai shurong updated YARN-3416: -- Attachment: AM_log_head10.txt.gz AM_log_tail10.txt.gz head 10 lines and tail 10 lines of AM log of a deadlock job. deadlock in a job between map and reduce cores allocation -- Key: YARN-3416 URL: https://issues.apache.org/jira/browse/YARN-3416 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: mai shurong Priority: Critical Attachments: AM_log_head10.txt.gz, AM_log_tail10.txt.gz I submit a big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with 300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied 300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the 300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job is blocked, and the later job in the queue cannot run because no available cores in the queue. I think there is the similar issue for memory of a queue . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3301: Target Version/s: 2.8.0 Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3412) RM tests should use MockRM where possible
[ https://issues.apache.org/jira/browse/YARN-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390431#comment-14390431 ] Hudson commented on YARN-3412: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/150/]) YARN-3412. RM tests should use MockRM where possible. (kasha) (kasha: rev 79f7f2aabfd7a69722748850f4d3b1ff54af7556) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerEventLog.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/TestSchedulingMonitor.java RM tests should use MockRM where possible - Key: YARN-3412 URL: https://issues.apache.org/jira/browse/YARN-3412 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.8.0 Attachments: yarn-3412-1.patch Noticed TestZKRMStateStore and TestMoveApplication fail when running on a mac, due to not being able to start the webapp. There are a few other tests that could use MockRM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters
[ https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390424#comment-14390424 ] Hudson commented on YARN-3304: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/150/]) YARN-3304. Addendum patch. Cleaning up ResourceCalculatorProcessTree APIs for public use and removing inconsistencies in the default values. (Junping Du and Karthik Kambatla via vinodkv) (vinodkv: rev 7610925e90155dfe5edce05da31574e4fb81b948) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters Key: YARN-3304 URL: https://issues.apache.org/jira/browse/YARN-3304 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3304-appendix-v2.patch, YARN-3304-appendix-v3.patch, YARN-3304-appendix-v4.patch, YARN-3304-appendix.patch, YARN-3304-v2.patch, YARN-3304-v3.patch, YARN-3304-v4-boolean-way.patch, YARN-3304-v4-negative-way-MR.patch, YARN-3304-v4-negtive-value-way.patch, YARN-3304-v6-no-rename.patch, YARN-3304-v6-with-rename.patch, YARN-3304-v7.patch, YARN-3304-v8.patch, YARN-3304.patch, yarn-3304-5.patch Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for unavailable case while other resource metrics are return 0 in the same case which sounds inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390430#comment-14390430 ] Hudson commented on YARN-3424: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/150/]) YARN-3424. Change logs for ContainerMonitorImpl's resourse monitoring from info to debug. Contributed by Anubhav Dhoot. (ozawa: rev c69ba81497ae4da329ddb34ba712a64a7eec479f) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java * hadoop-yarn-project/CHANGES.txt Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.8.0 Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390379#comment-14390379 ] Hadoop QA commented on YARN-3225: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708647/YARN-3225-3.patch against trunk revision c69ba81. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7188//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7188//console This message is automatically generated. New parameter or CLI for decommissioning node gracefully in RMAdmin CLI --- Key: YARN-3225 URL: https://issues.apache.org/jira/browse/YARN-3225 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Devaraj K Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, YARN-3225.patch, YARN-914.patch New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390412#comment-14390412 ] Xuan Gong commented on YARN-3248: - Committed into trunk/branch-2 Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390422#comment-14390422 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-trunk-Commit #7483 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7483/]) YARN-3248. Display count of nodes blacklisted by apps in the web UI. (xgong: rev 4728bdfa15809db4b8b235faa286c65de4a48cf6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlockWithMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3428) Debug log resources to be localized for a container
[ https://issues.apache.org/jira/browse/YARN-3428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390428#comment-14390428 ] Hudson commented on YARN-3428: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/150/]) YARN-3428. Debug log resources to be localized for a container. (kasha) (kasha: rev 2daa478a6420585dc13cea2111580ed5fe347bc1) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java * hadoop-yarn-project/CHANGES.txt Debug log resources to be localized for a container --- Key: YARN-3428 URL: https://issues.apache.org/jira/browse/YARN-3428 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.8.0 Attachments: yarn-3428-1.patch For each container, we log the resources going through INIT - LOCALIZING - DOWNLOADED transitions. These logs do not have container-id itself. It would be nice to add debug logs to capture the resources being localized for a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390452#comment-14390452 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-trunk-Commit #7484 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/7484/]) YARN-3248. Correct fix version from branch-2.7 to branch-2.8 in the change log. (xgong: rev 2e79f1c2125517586c165a84e99d3c4d38ca0938) * hadoop-yarn-project/CHANGES.txt Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3405) FairScheduler's preemption cannot happen between sibling in some case
[ https://issues.apache.org/jira/browse/YARN-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390455#comment-14390455 ] Peng Zhang commented on YARN-3405: -- I've a primitive idea to fix this and YARN-3414 under current preemption architecture. 1. When calculation preemption request, update parent's preemption request. 2. if parent's usage reached its fair share, it will not propagate preemption request upside again. So preemption request in parent queue means preemption needed between its children. 3. During preempting phase, walk from root to downside a. if parent queue has preemption request, it will do preemption between its children for the request(process like now, find the most over fair, and preempt recursively). b. And then(including after doing 3.a and the case not need preempt between children), traverse its children and repeat 3.a; This process bring in traverse of the tree. And I think this will not affect performance severely because there are usually small amount of queues. FairScheduler's preemption cannot happen between sibling in some case - Key: YARN-3405 URL: https://issues.apache.org/jira/browse/YARN-3405 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.7.0 Reporter: Peng Zhang Assignee: Peng Zhang Priority: Critical Queue hierarchy described as below: {noformat} root / \ queue-1 queue-2 / \ queue-1-1 queue-1-2 {noformat} Assume cluster resource is 100 # queue-1-1 and queue-2 has app. Each get 50 usage and 50 fairshare. # When queue-1-2 is active, and it cause some new preemption request for fairshare 25. # When preemption from root, it has possibility to find preemption candidate is queue-2. If so preemptContainerPreCheck for queue-2 return false because it's equal to its fairshare. # Finally queue-1-2 will be waiting for resource release form queue-1-1 itself. What I expect here is that queue-1-2 preempt from queue-1-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390542#comment-14390542 ] Rohith commented on YARN-3430: -- +1 lgtm (non-binding) RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3286) Cleanup RMNode#ReconnectNodeTransition
[ https://issues.apache.org/jira/browse/YARN-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390553#comment-14390553 ] Hadoop QA commented on YARN-3286: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708671/0001-YARN-3286.patch against trunk revision 2e79f1c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7189//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7189//console This message is automatically generated. Cleanup RMNode#ReconnectNodeTransition -- Key: YARN-3286 URL: https://issues.apache.org/jira/browse/YARN-3286 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.6.0, 2.7.0 Reporter: Rohith Assignee: Rohith Attachments: 0001-YARN-3286.patch, YARN-3286-test-only.patch RMNode#ReconnectNodeTransition has messed up for every ReconnectedEvent. This part of the code can be clean up where we do not require to remove node and add new node every time. Supporting to above point, in the issue discussion YARN-3222 mentioned in the comment [link1|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14339799page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14339799] and [link2|https://issues.apache.org/jira/browse/YARN-3222?focusedCommentId=14344739page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14344739] Clean up can do the following things # It always remove an old node and add a new node. This is not really required, instead old node can be updated with new values. # RMNode#totalCapability has stale capability after NM is reconnected. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-3301: Attachment: YARN-3301.1.patch Simple fix Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3301.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters
[ https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390616#comment-14390616 ] Hudson commented on YARN-3304: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2082 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2082/]) YARN-3304. Addendum patch. Cleaning up ResourceCalculatorProcessTree APIs for public use and removing inconsistencies in the default values. (Junping Du and Karthik Kambatla via vinodkv) (vinodkv: rev 7610925e90155dfe5edce05da31574e4fb81b948) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters Key: YARN-3304 URL: https://issues.apache.org/jira/browse/YARN-3304 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3304-appendix-v2.patch, YARN-3304-appendix-v3.patch, YARN-3304-appendix-v4.patch, YARN-3304-appendix.patch, YARN-3304-v2.patch, YARN-3304-v3.patch, YARN-3304-v4-boolean-way.patch, YARN-3304-v4-negative-way-MR.patch, YARN-3304-v4-negtive-value-way.patch, YARN-3304-v6-no-rename.patch, YARN-3304-v6-with-rename.patch, YARN-3304-v7.patch, YARN-3304-v8.patch, YARN-3304.patch, yarn-3304-5.patch Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for unavailable case while other resource metrics are return 0 in the same case which sounds inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390622#comment-14390622 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2082 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2082/]) YARN-3248. Display count of nodes blacklisted by apps in the web UI. (xgong: rev 4728bdfa15809db4b8b235faa286c65de4a48cf6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlockWithMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390623#comment-14390623 ] Hudson commented on YARN-3424: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2082 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2082/]) YARN-3424. Change logs for ContainerMonitorImpl's resourse monitoring from info to debug. Contributed by Anubhav Dhoot. (ozawa: rev c69ba81497ae4da329ddb34ba712a64a7eec479f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.8.0 Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3428) Debug log resources to be localized for a container
[ https://issues.apache.org/jira/browse/YARN-3428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390620#comment-14390620 ] Hudson commented on YARN-3428: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2082 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2082/]) YARN-3428. Debug log resources to be localized for a container. (kasha) (kasha: rev 2daa478a6420585dc13cea2111580ed5fe347bc1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Debug log resources to be localized for a container --- Key: YARN-3428 URL: https://issues.apache.org/jira/browse/YARN-3428 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.8.0 Attachments: yarn-3428-1.patch For each container, we log the resources going through INIT - LOCALIZING - DOWNLOADED transitions. These logs do not have container-id itself. It would be nice to add debug logs to capture the resources being localized for a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3412) RM tests should use MockRM where possible
[ https://issues.apache.org/jira/browse/YARN-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390624#comment-14390624 ] Hudson commented on YARN-3412: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #2082 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/2082/]) YARN-3412. RM tests should use MockRM where possible. (kasha) (kasha: rev 79f7f2aabfd7a69722748850f4d3b1ff54af7556) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/TestSchedulingMonitor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerEventLog.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java * hadoop-yarn-project/CHANGES.txt RM tests should use MockRM where possible - Key: YARN-3412 URL: https://issues.apache.org/jira/browse/YARN-3412 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.8.0 Attachments: yarn-3412-1.patch Noticed TestZKRMStateStore and TestMoveApplication fail when running on a mac, due to not being able to start the webapp. There are a few other tests that could use MockRM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390638#comment-14390638 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/150/]) YARN-3248. Display count of nodes blacklisted by apps in the web UI. (xgong: rev 4728bdfa15809db4b8b235faa286c65de4a48cf6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlockWithMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java YARN-3248. Correct fix version from branch-2.7 to branch-2.8 in the change log. (xgong: rev 2e79f1c2125517586c165a84e99d3c4d38ca0938) * hadoop-yarn-project/CHANGES.txt Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3428) Debug log resources to be localized for a container
[ https://issues.apache.org/jira/browse/YARN-3428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390636#comment-14390636 ] Hudson commented on YARN-3428: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/150/]) YARN-3428. Debug log resources to be localized for a container. (kasha) (kasha: rev 2daa478a6420585dc13cea2111580ed5fe347bc1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Debug log resources to be localized for a container --- Key: YARN-3428 URL: https://issues.apache.org/jira/browse/YARN-3428 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.8.0 Attachments: yarn-3428-1.patch For each container, we log the resources going through INIT - LOCALIZING - DOWNLOADED transitions. These logs do not have container-id itself. It would be nice to add debug logs to capture the resources being localized for a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters
[ https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390632#comment-14390632 ] Hudson commented on YARN-3304: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/150/]) YARN-3304. Addendum patch. Cleaning up ResourceCalculatorProcessTree APIs for public use and removing inconsistencies in the default values. (Junping Du and Karthik Kambatla via vinodkv) (vinodkv: rev 7610925e90155dfe5edce05da31574e4fb81b948) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters Key: YARN-3304 URL: https://issues.apache.org/jira/browse/YARN-3304 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3304-appendix-v2.patch, YARN-3304-appendix-v3.patch, YARN-3304-appendix-v4.patch, YARN-3304-appendix.patch, YARN-3304-v2.patch, YARN-3304-v3.patch, YARN-3304-v4-boolean-way.patch, YARN-3304-v4-negative-way-MR.patch, YARN-3304-v4-negtive-value-way.patch, YARN-3304-v6-no-rename.patch, YARN-3304-v6-with-rename.patch, YARN-3304-v7.patch, YARN-3304-v8.patch, YARN-3304.patch, yarn-3304-5.patch Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for unavailable case while other resource metrics are return 0 in the same case which sounds inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3412) RM tests should use MockRM where possible
[ https://issues.apache.org/jira/browse/YARN-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390640#comment-14390640 ] Hudson commented on YARN-3412: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/150/]) YARN-3412. RM tests should use MockRM where possible. (kasha) (kasha: rev 79f7f2aabfd7a69722748850f4d3b1ff54af7556) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/TestSchedulingMonitor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerEventLog.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java RM tests should use MockRM where possible - Key: YARN-3412 URL: https://issues.apache.org/jira/browse/YARN-3412 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.8.0 Attachments: yarn-3412-1.patch Noticed TestZKRMStateStore and TestMoveApplication fail when running on a mac, due to not being able to start the webapp. There are a few other tests that could use MockRM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390639#comment-14390639 ] Hudson commented on YARN-3424: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #150 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/150/]) YARN-3424. Change logs for ContainerMonitorImpl's resourse monitoring from info to debug. Contributed by Anubhav Dhoot. (ozawa: rev c69ba81497ae4da329ddb34ba712a64a7eec479f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.8.0 Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3412) RM tests should use MockRM where possible
[ https://issues.apache.org/jira/browse/YARN-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390662#comment-14390662 ] Hudson commented on YARN-3412: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #141 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/141/]) YARN-3412. RM tests should use MockRM where possible. (kasha) (kasha: rev 79f7f2aabfd7a69722748850f4d3b1ff54af7556) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/TestSchedulingMonitor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerEventLog.java RM tests should use MockRM where possible - Key: YARN-3412 URL: https://issues.apache.org/jira/browse/YARN-3412 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.8.0 Attachments: yarn-3412-1.patch Noticed TestZKRMStateStore and TestMoveApplication fail when running on a mac, due to not being able to start the webapp. There are a few other tests that could use MockRM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390660#comment-14390660 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #141 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/141/]) YARN-3248. Display count of nodes blacklisted by apps in the web UI. (xgong: rev 4728bdfa15809db4b8b235faa286c65de4a48cf6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlockWithMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3428) Debug log resources to be localized for a container
[ https://issues.apache.org/jira/browse/YARN-3428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390658#comment-14390658 ] Hudson commented on YARN-3428: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #141 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/141/]) YARN-3428. Debug log resources to be localized for a container. (kasha) (kasha: rev 2daa478a6420585dc13cea2111580ed5fe347bc1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Debug log resources to be localized for a container --- Key: YARN-3428 URL: https://issues.apache.org/jira/browse/YARN-3428 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.8.0 Attachments: yarn-3428-1.patch For each container, we log the resources going through INIT - LOCALIZING - DOWNLOADED transitions. These logs do not have container-id itself. It would be nice to add debug logs to capture the resources being localized for a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters
[ https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390654#comment-14390654 ] Hudson commented on YARN-3304: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #141 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/141/]) YARN-3304. Addendum patch. Cleaning up ResourceCalculatorProcessTree APIs for public use and removing inconsistencies in the default values. (Junping Du and Karthik Kambatla via vinodkv) (vinodkv: rev 7610925e90155dfe5edce05da31574e4fb81b948) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestProcfsBasedProcessTree.java ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters Key: YARN-3304 URL: https://issues.apache.org/jira/browse/YARN-3304 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3304-appendix-v2.patch, YARN-3304-appendix-v3.patch, YARN-3304-appendix-v4.patch, YARN-3304-appendix.patch, YARN-3304-v2.patch, YARN-3304-v3.patch, YARN-3304-v4-boolean-way.patch, YARN-3304-v4-negative-way-MR.patch, YARN-3304-v4-negtive-value-way.patch, YARN-3304-v6-no-rename.patch, YARN-3304-v6-with-rename.patch, YARN-3304-v7.patch, YARN-3304-v8.patch, YARN-3304.patch, yarn-3304-5.patch Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for unavailable case while other resource metrics are return 0 in the same case which sounds inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390661#comment-14390661 ] Hudson commented on YARN-3424: -- FAILURE: Integrated in Hadoop-Hdfs-trunk-Java8 #141 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/141/]) YARN-3424. Change logs for ContainerMonitorImpl's resourse monitoring from info to debug. Contributed by Anubhav Dhoot. (ozawa: rev c69ba81497ae4da329ddb34ba712a64a7eec479f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.8.0 Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2467) Add SpanReceiverHost to ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-2467: --- Summary: Add SpanReceiverHost to ResourceManager (was: Add SpanReceiverHost to YARN daemons ) Add SpanReceiverHost to ResourceManager --- Key: YARN-2467 URL: https://issues.apache.org/jira/browse/YARN-2467 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2467) Add SpanReceiverHost to ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-2467: --- Component/s: (was: nodemanager) Add SpanReceiverHost to ResourceManager --- Key: YARN-2467 URL: https://issues.apache.org/jira/browse/YARN-2467 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390766#comment-14390766 ] Hadoop QA commented on YARN-3430: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708678/YARN-3430.1.patch against trunk revision 2e79f1c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerAllocation Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7190//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7190//console This message is automatically generated. RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-3334: - Attachment: YARN-3334-v5.patch Upload v5 patch with addressing all review comments above. For using ContainerEntity to replace TimelineEntity, there is a bug that UnrecognizedPropertyException will get thrown in serialize/deserialize children element when consuming it as base class (TimelineEntity). Comment that element annotation out until we find a better solution (will not addressed in this JIRA). [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2467) Add SpanReceiverHost to ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated YARN-2467: --- Attachment: YARN-2467.001.patch I would like to narrow down the focus of this sub-task to ResourceManager only. Attached patch adds SpanReceiverHost to RM and moves some testing utils from hadoop-hdfs to hadoop-common. Add SpanReceiverHost to ResourceManager --- Key: YARN-2467 URL: https://issues.apache.org/jira/browse/YARN-2467 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Masatake Iwasaki Assignee: Masatake Iwasaki Attachments: YARN-2467.001.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390900#comment-14390900 ] Sangjin Lee commented on YARN-3391: --- Hi [~djp], The flow id identifies a distinct flow application that can be run repeatedly over time. The flow run id identifies one instance (or specific execution) of that flow. Finally, the flow version keeps track of the changes made to the flow (e.g. changes to the source code). Let me give you a concrete example. Suppose you have a pig script you run repeatedly, named tracking.pig. The flow id in this case may be tracking.pig (or al...@tracking.pig to denote the fact that user alice runs this script). The tracking.pig script will be run repeatedly many times. If I run it today, that specific run may have the flow run id of 1427846400 (timestamp when the pig script started). If I run it again tomorrow, the run id of that run would be 1427932800, and so on. Multiple run id's for the same flow id is a series of runs of the same script. The flow version identifies changes made to the flow (user application). One scheme may be to use some kind of a hash of the pig script. Another scheme may be to use the git commit hash. Or some real versions if the user application has well-defined versions. A flow run is *NOT* a subset of YARN apps run inside a flow. A flow is a template of runs if you will, and a flow run is an actual run instances of that flow. These are described in some detail in the original design doc in YARN-2928. I hope this helps. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390791#comment-14390791 ] Hadoop QA commented on YARN-3301: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708686/YARN-3301.1.patch against trunk revision 2e79f1c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestAllocationFileLoaderService Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7191//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7191//console This message is automatically generated. Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3301.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390844#comment-14390844 ] Junping Du commented on YARN-3391: -- Thanks [~zjshen] for delivering the patch! To be honest, I am getting more confused on these concepts from some discussion above: From what I was understanding, flow is a group of applications that will get run (sequential or parallel) in a batch, and flow_run is one run branch for subset of flow applications (apps in flow_run only get run in sequence, however, different flow_runs under one flow could run in parallel). Does flow version sounds like a timestamp concept (from HBase prospective) which represent a specific run time for the flow? Just quickly go through the attached patch, I didn't find answer there. I think we should document the concept/definition of flow, flow run and flow version clearly in Javadoc (web doc could be later when we finish the feature) which could help reviewer and developers to understand better. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390941#comment-14390941 ] Zhijie Shen commented on YARN-3391: --- I'll put some description in the javadoc somewhere, but I think eventually we need to describe it clearly in the documentation of YTS v2. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3225) New parameter or CLI for decommissioning node gracefully in RMAdmin CLI
[ https://issues.apache.org/jira/browse/YARN-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390670#comment-14390670 ] Devaraj K commented on YARN-3225: - This failed test is not related to the patch. {code:xml} org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler {code} New parameter or CLI for decommissioning node gracefully in RMAdmin CLI --- Key: YARN-3225 URL: https://issues.apache.org/jira/browse/YARN-3225 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Devaraj K Attachments: YARN-3225-1.patch, YARN-3225-2.patch, YARN-3225-3.patch, YARN-3225.patch, YARN-914.patch New CLI (or existing CLI with parameters) should put each node on decommission list to decommissioning status and track timeout to terminate the nodes that haven't get finished. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390936#comment-14390936 ] Junping Du commented on YARN-3391: -- Thanks [~sjlee0] for reply quickly! That helps a lot. I initially thought flow run is a run instance (may from YARN-2928 design doc or somewhere) but get confused to something else when I saw flow version. Thanks for bringing me back. :) Given other contributors could miss discussions here, I would suggest we add Javadoc to explain these somewhere, e.g. in TimelineCollectorContext.java. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391738#comment-14391738 ] Rohit Agarwal commented on YARN-3415: - +1 Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue - Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
[ https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brahma Reddy Battula reassigned YARN-3432: -- Assignee: Brahma Reddy Battula Cluster metrics have wrong Total Memory when there is reserved memory on CS --- Key: YARN-3432 URL: https://issues.apache.org/jira/browse/YARN-3432 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Brahma Reddy Battula I noticed that when reservations happen when using the Capacity Scheduler, the UI and web services report the wrong total memory. For example. I have a 300GB of total memory in my cluster. I allocate 50 and I reserve 10. The cluster metrics for total memory get reported as 290GB. This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1572) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal
[ https://issues.apache.org/jira/browse/YARN-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kareem El Gebaly updated YARN-1572: --- Attachment: YARN-1572-branch-2.3.0.001.patch Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal -- Key: YARN-1572 URL: https://issues.apache.org/jira/browse/YARN-1572 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-1572-branch-2.3.0.001.patch, YARN-1572-log.tar.gz, conf.tar.gz, log.tar.gz we have lower chance to hit NPE in allocateNodeLocal when run benchmark(hit 4 in 20 times). {code} 2014-07-31 04:18:19,653 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1406794589275_0001_01_21 of capacity memory:1024, vCores:1 on host datanode10:57281, which has 6 containers, memory:6144, vCores:6 used and memory:2048, vCores:2 available after allocation 2014-07-31 04:18:19,654 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:311) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:268) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.allocate(FiCaSchedulerApp.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainer(FifoScheduler.java:683) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignNodeLocalContainers(FifoScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainersOnNode(FifoScheduler.java:560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:488) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:729) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) at java.lang.Thread.run(Thread.java:662) 2014-07-31 04:18:19,655 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391864#comment-14391864 ] Zhijie Shen commented on YARN-3391: --- Sangjin, thanks for your comments, too. According to your and Joep's comments, I can see the benefit to show application aggregation information by application (type). However, IMHO, it's orthogonal to flow definition. Isn't the straightforward approach to provide this feature via aggregating on application name/type dimension instead of let flow name = application name. On the other side, flow should semantically stand for *workflow* (correct me if I'm wrong about flow concept), which contains a group of applications that work together to resolve a problem. Making flow name == application name changes the semantics That said, a flow of applications means the applications of the same type. {quote} If a user is running TestDFSIO over and over, they should be recognized as different instances of the same thing. {quote} I guess the same thing you had in mind is not the same workflow, but the same application type, right? How about we decoupling the two concepts? One step back, when users set the flow explicitly, are they going to tell the application that you belong to workflow abc, or that you belong to job type xyz? I think it will be the former. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3413) Node label attributes (like exclusive or not) should be able to set when addToClusterNodeLabels and shouldn't be changed during runtime
[ https://issues.apache.org/jira/browse/YARN-3413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391866#comment-14391866 ] Hadoop QA commented on YARN-3413: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708821/YARN-3413.1.patch against trunk revision c94d594. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7194//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/7194//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7194//console This message is automatically generated. Node label attributes (like exclusive or not) should be able to set when addToClusterNodeLabels and shouldn't be changed during runtime --- Key: YARN-3413 URL: https://issues.apache.org/jira/browse/YARN-3413 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3413.1.patch As mentioned in : https://issues.apache.org/jira/browse/YARN-3345?focusedCommentId=14384947page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14384947. Changing node label exclusivity and/or other attributes may not be a real use case, and also we should support setting node label attributes whiling adding them to cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391921#comment-14391921 ] Hadoop QA commented on YARN-2729: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708788/YARN-2729.20150402-1.patch against trunk revision c94d594. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover org.apache.hadoop.hdfs.TestLeaseRecovery2 Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7193//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7193//console This message is automatically generated. Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup --- Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, YARN-2729.20150402-1.patch Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2740) ResourceManager side should properly handle node label modifications when distributed node label configuration enabled
[ https://issues.apache.org/jira/browse/YARN-2740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392066#comment-14392066 ] Naganarasimha G R commented on YARN-2740: - Thanks for the review [~wangda], bq. Beyond CommonNodeLabelsManager shouldn't persist labels on nodes when NM do heartbeat., it shouldn't recover labels on nodes when RM restart. This is because RM configured centralized config, add some labels to nodes and change config to distributed then restart. Good catch !. This i can achieve in couple of ways * Modify {{NodeLabelsStore.recover()}} to accept a boolean parameter like {{boolean skipNodeToLabelsMappings}} and leave the responsibility to the store (FileSystemNodeLabelsStore need to take care of skipping) * Add a method in CommonNodeLabelsManager like {{recoverLabelsOnNode}} and let the store use this instead of {{replaceLabelsOnNode}} and we can handle the skipping in the new method i.e. {{CommonNodeLabelsManager.recoverLabelsOnNode}}. If needed to further ensure that NodeLabelsStore do not call replaceLabelsOnNode we can extract a interface for the methods used by the NodeLabelsStore and make CommonNodeLabelsManager implement it. Please provide your opinion on the suggested approaches and also if you have any other alternatives in mind. 2nd point will handle in the next patch ResourceManager side should properly handle node label modifications when distributed node label configuration enabled -- Key: YARN-2740 URL: https://issues.apache.org/jira/browse/YARN-2740 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2740-20141024-1.patch, YARN-2740.20150320-1.patch, YARN-2740.20150327-1.patch According to YARN-2495, when distributed node label configuration is enabled: - RMAdmin / REST API should reject change labels on node operations. - CommonNodeLabelsManager shouldn't persist labels on nodes when NM do heartbeat. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2261) YARN should have a way to run post-application cleanup
[ https://issues.apache.org/jira/browse/YARN-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391872#comment-14391872 ] Vinod Kumar Vavilapalli commented on YARN-2261: --- MAPREDUCE-4099 originally facilitated this for MapReduce in a not so ideal way. YARN should have a way to run post-application cleanup -- Key: YARN-2261 URL: https://issues.apache.org/jira/browse/YARN-2261 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli See MAPREDUCE-5956 for context. Specific options are at https://issues.apache.org/jira/browse/MAPREDUCE-5956?focusedCommentId=14054562page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054562. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3432) Cluster metrics have wrong Total Memory when there is reserved memory on CS
[ https://issues.apache.org/jira/browse/YARN-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391875#comment-14391875 ] Brahma Reddy Battula commented on YARN-3432: Reverting of YARN-656 should be fine I think.. Cluster metrics have wrong Total Memory when there is reserved memory on CS --- Key: YARN-3432 URL: https://issues.apache.org/jira/browse/YARN-3432 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler, resourcemanager Affects Versions: 2.6.0 Reporter: Thomas Graves Assignee: Brahma Reddy Battula I noticed that when reservations happen when using the Capacity Scheduler, the UI and web services report the wrong total memory. For example. I have a 300GB of total memory in my cluster. I allocate 50 and I reserve 10. The cluster metrics for total memory get reported as 290GB. This was broken by https://issues.apache.org/jira/browse/YARN-656 so perhaps there is a difference between fair scheduler and capacity scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391959#comment-14391959 ] Vrushali C commented on YARN-3391: -- For default values, workflow = appname is much more user friendly and intuitive than workflow name = flow_number_number. Setting the flow name to flow_number_number per run will mean the UI will have a lengthy list of flow_number_number (similar to JT/RM). This will not be a step up from current JT / RM UI experience. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391972#comment-14391972 ] Hadoop QA commented on YARN-3415: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708850/YARN-3415.002.patch against trunk revision 4d14816. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/7196//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7196//console This message is automatically generated. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue - Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1572) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal
[ https://issues.apache.org/jira/browse/YARN-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391976#comment-14391976 ] Hadoop QA commented on YARN-1572: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708878/YARN-1572-branch-2.3.0.001.patch against trunk revision f383fd9. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7197//console This message is automatically generated. Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal -- Key: YARN-1572 URL: https://issues.apache.org/jira/browse/YARN-1572 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-1572-branch-2.3.0.001.patch, YARN-1572-log.tar.gz, conf.tar.gz, log.tar.gz we have lower chance to hit NPE in allocateNodeLocal when run benchmark(hit 4 in 20 times). {code} 2014-07-31 04:18:19,653 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1406794589275_0001_01_21 of capacity memory:1024, vCores:1 on host datanode10:57281, which has 6 containers, memory:6144, vCores:6 used and memory:2048, vCores:2 available after allocation 2014-07-31 04:18:19,654 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:311) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:268) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.allocate(FiCaSchedulerApp.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainer(FifoScheduler.java:683) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignNodeLocalContainers(FifoScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainersOnNode(FifoScheduler.java:560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:488) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:729) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) at java.lang.Thread.run(Thread.java:662) 2014-07-31 04:18:19,655 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3410) YARN admin should be able to remove individual application records from RMStateStore
[ https://issues.apache.org/jira/browse/YARN-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392173#comment-14392173 ] Rohith commented on YARN-3410: -- For state store format in YARN-2131, discussion happened whether to format state using admin service or resourcemanager start up options [comment link|https://issues.apache.org/jira/browse/YARN-2131?focusedCommentId=14032694page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14032694]. Similarly I am thinking for application state deletion options # ./yarn resourcemanager -delete-from-state-store app-id OR # ./yarn rmadmin -delete-from-state-store app-id 1st choice is pretty staight forward deletion neverthless of app state is finished or running. I would like to choose 2nd option. YARN admin should be able to remove individual application records from RMStateStore Key: YARN-3410 URL: https://issues.apache.org/jira/browse/YARN-3410 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, yarn Reporter: Wangda Tan Assignee: Rohith Priority: Critical When RM state store entered an unexpected state, one example is YARN-2340, when an attempt is not in final state but app already completed, RM can never get up unless format RMStateStore. I think we should support remove individual application records from RMStateStore to unblock RM admin make choice of either waiting for a fix or format state store. In addition, RM should be able to report all fatal errors (which will shutdown RM) when doing app recovery, this can save admin some time to remove apps in bad state. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1572) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal
[ https://issues.apache.org/jira/browse/YARN-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391710#comment-14391710 ] Hadoop QA commented on YARN-1572: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12708801/0001-Fix-for-YARN-1572.patch against trunk revision 3c7adaa. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/7195//console This message is automatically generated. Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal -- Key: YARN-1572 URL: https://issues.apache.org/jira/browse/YARN-1572 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: 0001-Fix-for-YARN-1572.patch, YARN-1572-log.tar.gz, conf.tar.gz, log.tar.gz we have lower chance to hit NPE in allocateNodeLocal when run benchmark(hit 4 in 20 times). {code} 2014-07-31 04:18:19,653 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1406794589275_0001_01_21 of capacity memory:1024, vCores:1 on host datanode10:57281, which has 6 containers, memory:6144, vCores:6 used and memory:2048, vCores:2 available after allocation 2014-07-31 04:18:19,654 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:311) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:268) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.allocate(FiCaSchedulerApp.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainer(FifoScheduler.java:683) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignNodeLocalContainers(FifoScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainersOnNode(FifoScheduler.java:560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:488) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:729) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) at java.lang.Thread.run(Thread.java:662) 2014-07-31 04:18:19,655 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391728#comment-14391728 ] zhihai xu commented on YARN-3415: - [~ragarwal], thanks for the review. I uploaded a new patch YARN-3415.002.patch which addressed your comment. Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue - Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1572) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal
[ https://issues.apache.org/jira/browse/YARN-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kareem El Gebaly updated YARN-1572: --- Attachment: (was: 0001-Fix-for-YARN-1572.patch) Low chance to hit NPE issue in AppSchedulingInfo#allocateNodeLocal -- Key: YARN-1572 URL: https://issues.apache.org/jira/browse/YARN-1572 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.3.0 Reporter: Wenwu Peng Assignee: Wenwu Peng Attachments: YARN-1572-log.tar.gz, conf.tar.gz, log.tar.gz we have lower chance to hit NPE in allocateNodeLocal when run benchmark(hit 4 in 20 times). {code} 2014-07-31 04:18:19,653 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode: Assigned container container_1406794589275_0001_01_21 of capacity memory:1024, vCores:1 on host datanode10:57281, which has 6 containers, memory:6144, vCores:6 used and memory:2048, vCores:2 available after allocation 2014-07-31 04:18:19,654 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocateNodeLocal(AppSchedulingInfo.java:311) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AppSchedulingInfo.allocate(AppSchedulingInfo.java:268) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.allocate(FiCaSchedulerApp.java:136) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainer(FifoScheduler.java:683) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignNodeLocalContainers(FifoScheduler.java:602) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainersOnNode(FifoScheduler.java:560) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.assignContainers(FifoScheduler.java:488) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.nodeUpdate(FifoScheduler.java:729) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:774) at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.handle(FifoScheduler.java:101) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:599) at java.lang.Thread.run(Thread.java:662) 2014-07-31 04:18:19,655 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391719#comment-14391719 ] Jian Fang commented on YARN-796: JIRA MAPREDUCE-6304 has been created for this purpose. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, Non-exclusive-Node-Partition-Design.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.13.patch, YARN-796.node-label.consolidate.14.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3382) Some of UserMetricsInfo metrics are incorrectly set to root queue metrics
[ https://issues.apache.org/jira/browse/YARN-3382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rohit Agarwal updated YARN-3382: Target Version/s: 2.7.0 Affects Version/s: 2.2.0 2.3.0 2.4.0 2.5.0 2.6.0 Some of UserMetricsInfo metrics are incorrectly set to root queue metrics - Key: YARN-3382 URL: https://issues.apache.org/jira/browse/YARN-3382 Project: Hadoop YARN Issue Type: Bug Components: webapp Affects Versions: 2.2.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0 Reporter: Rohit Agarwal Assignee: Rohit Agarwal Attachments: YARN-3382.patch {{appsCompleted}}, {{appsPending}}, {{appsRunning}} etc. in {{UserMetricsInfo}} are incorrectly set to the root queue's value instead of the user's value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391864#comment-14391864 ] Zhijie Shen edited comment on YARN-3391 at 4/2/15 12:39 AM: Sangjin, thanks for your comments, too. According to your and Joep's comments, I can see the benefit to show application aggregation information by application (type). However, IMHO, it's orthogonal to flow definition. Isn't the straightforward approach to provide this feature via aggregating on application name/type dimension instead of let flow name = application name. On the other side, flow should semantically stand for *workflow* (correct me if I'm wrong about flow concept), which contains a group of applications that work together to resolve a problem. Making flow name == application name changes the semantics That said, a flow of applications means the applications of the same type. {quote} If a user is running TestDFSIO over and over, they should be recognized as different instances of the same thing. {quote} I guess the same thing you had in mind is not the same workflow, but the same application type, right? And back to Joep's web UI example, it's better to be described as getting sum(cost) from apps where app_name(type) = sleep. Therefore, how about we decoupling the two concepts? One step back, when users set the flow explicitly, are they going to tell the application that it belongs to workflow ABC, or that it belongs to job type XYZ? I think it will be the former. was (Author: zjshen): Sangjin, thanks for your comments, too. According to your and Joep's comments, I can see the benefit to show application aggregation information by application (type). However, IMHO, it's orthogonal to flow definition. Isn't the straightforward approach to provide this feature via aggregating on application name/type dimension instead of let flow name = application name. On the other side, flow should semantically stand for *workflow* (correct me if I'm wrong about flow concept), which contains a group of applications that work together to resolve a problem. Making flow name == application name changes the semantics That said, a flow of applications means the applications of the same type. {quote} If a user is running TestDFSIO over and over, they should be recognized as different instances of the same thing. {quote} I guess the same thing you had in mind is not the same workflow, but the same application type, right? How about we decoupling the two concepts? One step back, when users set the flow explicitly, are they going to tell the application that you belong to workflow abc, or that you belong to job type xyz? I think it will be the former. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2729) Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup
[ https://issues.apache.org/jira/browse/YARN-2729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392025#comment-14392025 ] Naganarasimha G R commented on YARN-2729: - Testcase failures not related to my jira Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup --- Key: YARN-2729 URL: https://issues.apache.org/jira/browse/YARN-2729 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Naganarasimha G R Assignee: Naganarasimha G R Fix For: 2.8.0 Attachments: YARN-2729.20141023-1.patch, YARN-2729.20141024-1.patch, YARN-2729.20141031-1.patch, YARN-2729.20141120-1.patch, YARN-2729.20141210-1.patch, YARN-2729.20150309-1.patch, YARN-2729.20150322-1.patch, YARN-2729.20150401-1.patch, YARN-2729.20150402-1.patch Support script based NodeLabelsProvider Interface in Distributed Node Label Configuration Setup . -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
[ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3415: Attachment: YARN-3415.002.patch Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue - Key: YARN-3415 URL: https://issues.apache.org/jira/browse/YARN-3415 Project: Hadoop YARN Issue Type: Bug Components: fairscheduler Affects Versions: 2.6.0 Reporter: Rohit Agarwal Assignee: zhihai xu Priority: Critical Attachments: YARN-3415.000.patch, YARN-3415.001.patch, YARN-3415.002.patch We encountered this problem while running a spark cluster. The amResourceUsage for a queue became artificially high and then the cluster got deadlocked because the maxAMShare constrain kicked in and no new AM got admitted to the cluster. I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289 In summary - the condition for adding the container's memory towards amResourceUsage is fragile. It depends on the number of live containers belonging to the app. We saw that the spark AM went down without explicitly releasing its requested containers and then one of those containers memory was counted towards amResource. cc - [~sandyr] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3391) Clearly define flow ID/ flow run / flow version in API and storage
[ https://issues.apache.org/jira/browse/YARN-3391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391711#comment-14391711 ] Sangjin Lee commented on YARN-3391: --- OK, just to clarify, we're talking about a case where one flow (run) is one YARN app. The only debate is whether the repeated runs of the (essentially) same YARN app should be grouped as different runs of the same flow, or all different flows altogether. In other words, *if it ran 100 times, should we have 100 flow runs of one flow, or 100 flows each of which has exactly one flow run?* To me it seems a no brainer (thanks [~vrushalic] for reminding me) that we do want to group the runs of the same YARN app. If a user is running TestDFSIO over and over, they should be recognized as different instances of the same thing. One mitigating factor is we would modify the mapreduce code to provide the flow name/id in case it's not set. Then the default behavior won't kick in for the most part. But I think it is important enough to group them and surface them as instances of the same flow. Clearly define flow ID/ flow run / flow version in API and storage -- Key: YARN-3391 URL: https://issues.apache.org/jira/browse/YARN-3391 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-3391.1.patch To continue the discussion in YARN-3040, let's figure out the best way to describe the flow. Some key issues that we need to conclude on: - How do we include the flow version in the context so that it gets passed into the collector and to the storage eventually? - Flow run id should be a number as opposed to a generic string? - Default behavior for the flow run id if it is missing (i.e. client did not set it) - How do we handle flow attributes in case of nested levels of flows? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3430) RMAppAttempt headroom data is missing in RM Web UI
[ https://issues.apache.org/jira/browse/YARN-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391794#comment-14391794 ] Zhijie Shen commented on YARN-3430: --- I temporally remove this commit from branch-2.7 to keep the branch compilable. It's pending on whether we can pull YARN-3273 in 2.7. RMAppAttempt headroom data is missing in RM Web UI -- Key: YARN-3430 URL: https://issues.apache.org/jira/browse/YARN-3430 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager, webapp, yarn Reporter: Xuan Gong Assignee: Xuan Gong Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3430.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2369) Environment variable handling assumes values should be appended
[ https://issues.apache.org/jira/browse/YARN-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dustin Cote updated YARN-2369: -- Attachment: YARN-2369-1.patch I like the second idea where the user should explicitly append to the variable. I think we can do this just by removing the code to append and just replace the entire variable every time we get an update. I'm going to try this out, but figured I'd attach the code change in case I'm missing something obvious. Environment variable handling assumes values should be appended --- Key: YARN-2369 URL: https://issues.apache.org/jira/browse/YARN-2369 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.2.0 Reporter: Jason Lowe Assignee: Dustin Cote Attachments: YARN-2369-1.patch When processing environment variables for a container context the code assumes that the value should be appended to any pre-existing value in the environment. This may be desired behavior for handling path-like environment variables such as PATH, LD_LIBRARY_PATH, CLASSPATH, etc. but it is a non-intuitive and harmful way to handle any variable that does not have path-like semantics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3428) Debug log resources to be localized for a container
[ https://issues.apache.org/jira/browse/YARN-3428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391186#comment-14391186 ] Hudson commented on YARN-3428: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2100 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2100/]) YARN-3428. Debug log resources to be localized for a container. (kasha) (kasha: rev 2daa478a6420585dc13cea2111580ed5fe347bc1) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/localizer/ResourceLocalizationService.java Debug log resources to be localized for a container --- Key: YARN-3428 URL: https://issues.apache.org/jira/browse/YARN-3428 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Priority: Trivial Fix For: 2.8.0 Attachments: yarn-3428-1.patch For each container, we log the resources going through INIT - LOCALIZING - DOWNLOADED transitions. These logs do not have container-id itself. It would be nice to add debug logs to capture the resources being localized for a container. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3412) RM tests should use MockRM where possible
[ https://issues.apache.org/jira/browse/YARN-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391190#comment-14391190 ] Hudson commented on YARN-3412: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2100 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2100/]) YARN-3412. RM tests should use MockRM where possible. (kasha) (kasha: rev 79f7f2aabfd7a69722748850f4d3b1ff54af7556) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestMoveApplication.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/monitor/TestSchedulingMonitor.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestResourceManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/TestZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/TestFairSchedulerEventLog.java RM tests should use MockRM where possible - Key: YARN-3412 URL: https://issues.apache.org/jira/browse/YARN-3412 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager, test Affects Versions: 2.7.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Fix For: 2.8.0 Attachments: yarn-3412-1.patch Noticed TestZKRMStateStore and TestMoveApplication fail when running on a mac, due to not being able to start the webapp. There are a few other tests that could use MockRM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3248) Display count of nodes blacklisted by apps in the web UI
[ https://issues.apache.org/jira/browse/YARN-3248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391188#comment-14391188 ] Hudson commented on YARN-3248: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2100 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2100/]) YARN-3248. Display count of nodes blacklisted by apps in the web UI. (xgong: rev 4728bdfa15809db4b8b235faa286c65de4a48cf6) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerApplicationAttempt.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/dao/AppAttemptInfo.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMWebServices.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/CapacitySchedulerPage.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppsBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/AppsBlockWithMetrics.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/TestRMWebServicesApps.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/webapp/AppAttemptBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/webapp/RMAppBlock.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AppSchedulingInfo.java YARN-3248. Correct fix version from branch-2.7 to branch-2.8 in the change log. (xgong: rev 2e79f1c2125517586c165a84e99d3c4d38ca0938) * hadoop-yarn-project/CHANGES.txt Display count of nodes blacklisted by apps in the web UI Key: YARN-3248 URL: https://issues.apache.org/jira/browse/YARN-3248 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Fix For: 2.8.0 Attachments: All applications.png, App page.png, Screenshot.jpg, apache-yarn-3248.0.patch, apache-yarn-3248.1.patch, apache-yarn-3248.2.patch, apache-yarn-3248.3.patch, apache-yarn-3248.4.patch It would be really useful when debugging app performance and failure issues to get a count of the nodes blacklisted by individual apps displayed in the web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3301) Fix the format issue of the new RM web UI and AHS web UI
[ https://issues.apache.org/jira/browse/YARN-3301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391012#comment-14391012 ] Junping Du commented on YARN-3301: -- Thanks [~xgong] for delivering a patch. Does test failure here related to your patch? Fix the format issue of the new RM web UI and AHS web UI Key: YARN-3301 URL: https://issues.apache.org/jira/browse/YARN-3301 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-3301.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3424) Change logs for ContainerMonitorImpl's resourse monitoring from info to debug
[ https://issues.apache.org/jira/browse/YARN-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391189#comment-14391189 ] Hudson commented on YARN-3424: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2100 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2100/]) YARN-3424. Change logs for ContainerMonitorImpl's resourse monitoring from info to debug. Contributed by Anubhav Dhoot. (ozawa: rev c69ba81497ae4da329ddb34ba712a64a7eec479f) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/monitor/ContainersMonitorImpl.java Change logs for ContainerMonitorImpl's resourse monitoring from info to debug - Key: YARN-3424 URL: https://issues.apache.org/jira/browse/YARN-3424 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.7.0 Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Fix For: 2.8.0 Attachments: YARN-3424.001.patch Today we log the memory usage of process at info level which spams the log with hundreds of log lines {noformat} 2015-03-27 09:32:48,905 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9215 for container-id container_1427462602546_0002_01_08: 189.8 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used {noformat} Proposing changing this to debug level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3304) ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters
[ https://issues.apache.org/jira/browse/YARN-3304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391181#comment-14391181 ] Hudson commented on YARN-3304: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2100 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2100/]) YARN-3304. Addendum patch. Cleaning up ResourceCalculatorProcessTree APIs for public use and removing inconsistencies in the default values. (Junping Du and Karthik Kambatla via vinodkv) (vinodkv: rev 7610925e90155dfe5edce05da31574e4fb81b948) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestWindowsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestProcfsBasedProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/util/TestResourceCalculatorProcessTree.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/WindowsBasedProcessTree.java ResourceCalculatorProcessTree#getCpuUsagePercent default return value is inconsistent with other getters Key: YARN-3304 URL: https://issues.apache.org/jira/browse/YARN-3304 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Reporter: Junping Du Assignee: Junping Du Priority: Blocker Fix For: 2.7.0 Attachments: YARN-3304-appendix-v2.patch, YARN-3304-appendix-v3.patch, YARN-3304-appendix-v4.patch, YARN-3304-appendix.patch, YARN-3304-v2.patch, YARN-3304-v3.patch, YARN-3304-v4-boolean-way.patch, YARN-3304-v4-negative-way-MR.patch, YARN-3304-v4-negtive-value-way.patch, YARN-3304-v6-no-rename.patch, YARN-3304-v6-with-rename.patch, YARN-3304-v7.patch, YARN-3304-v8.patch, YARN-3304.patch, yarn-3304-5.patch Per discussions in YARN-3296, getCpuUsagePercent() will return -1 for unavailable case while other resource metrics are return 0 in the same case which sounds inconsistent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3334) [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service.
[ https://issues.apache.org/jira/browse/YARN-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391059#comment-14391059 ] Zhijie Shen commented on YARN-3334: --- bq. For using ContainerEntity to replace TimelineEntity, there is a bug that UnrecognizedPropertyException will get thrown in serialize/deserialize children element when consuming it as base class (TimelineEntity). I probably know the problem. I'll fix it seperately: YARN-3431. Let's leave this issue in this jira. [Event Producers] NM TimelineClient life cycle handling and container metrics posting to new timeline service. -- Key: YARN-3334 URL: https://issues.apache.org/jira/browse/YARN-3334 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: YARN-2928 Reporter: Junping Du Assignee: Junping Du Attachments: YARN-3334-demo.patch, YARN-3334-v1.patch, YARN-3334-v2.patch, YARN-3334-v3.patch, YARN-3334-v4.patch, YARN-3334-v5.patch After YARN-3039, we have service discovery mechanism to pass app-collector service address among collectors, NMs and RM. In this JIRA, we will handle service address setting for TimelineClients in NodeManager, and put container metrics to the backend storage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)