[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.11.patch Attached ver.11 fixed javac warnings, findbug warnings and test failures. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160141#comment-14160141 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673059/YARN-796.node-label.consolidate.11.patch against trunk revision 16333b4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 42 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient hadoop-tools/hadoop-sls hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.mapred.pipes.TestPipeApplication org.apache.hadoop.yarn.api.TestPBImplRecords org.apache.hadoop.yarn.nodelabels.TestFileSystemNodeLabelsStore org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesNodeLabels org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservationQueue org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.TestProportionalCapacityPreemptionPolicy org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5272//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5272//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5272//console This message is automatically generated. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2625) Problems with CLASSPATH in Job Submission REST API
[ https://issues.apache.org/jira/browse/YARN-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160170#comment-14160170 ] Rohith commented on YARN-2625: -- While submitting appliclaiton from REST api, it expects CLASSPATH to be added by application client. As a support, ContainerLaunchContextInfo#setEnvironment() provide to add classpath variables. I think adding default values at server side may not feasible. bq. 2) If any environment variables are used in the CLASSPATH environment 'value' field, they are evaluated when the values are NULL resulting in bad values in the CLASSPATH This is because cilent JVM does not resolve $HADOOP_COMMON_HOME environment variable.It is expected to provide full path when submitting from client. Problems with CLASSPATH in Job Submission REST API -- Key: YARN-2625 URL: https://issues.apache.org/jira/browse/YARN-2625 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.5.1 Reporter: Doug Haigh There are a couple of issues I have found specifying the CLASSPATH environment variable using the REST API. 1) In the Java client, the CLASSPATH environment is usually made up of either the value of the yarn.application.classpath in yarn-site.xml value or the default YARN classpath value as defined by YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH. REST API consumers have no method of telling the resource manager to use the default unless they hardcode the default value themselves. If the default ever changes, the code would need to change. 2) If any environment variables are used in the CLASSPATH environment 'value' field, they are evaluated when the values are NULL resulting in bad values in the CLASSPATH. For example, if I had hardcoded the CLASSPATH value to the default of $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* the classpath passed to the application master is :/share/hadoop/common/*:/share/hadoop/common/lib/*:/share/hadoop/hdfs/*:/share/hadoop/hdfs/lib/*:/share/hadoop/yarn/*:/share/hadoop/yarn/lib/* These two problems require REST API consumers to always have the fully resolved path defined in the yarn.application.classpath value. If the property is missing or contains environment varaibles, the application created by the REST API will fail due to the CLASSPATH being incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2625) Problems with CLASSPATH in Job Submission REST API
[ https://issues.apache.org/jira/browse/YARN-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160255#comment-14160255 ] Doug Haigh commented on YARN-2625: -- Yes, I agree the classpath should be set by the client, but the REST client should not have to know the default classpath just as the Java client does not need to know it. Just as the REST API resolves {{CPS}} to either {{:}} or {{;}} based on the underlying operating system, the REST API could look for {{DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH}} and resolve it to {{YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH}}. As for environment variables being resolved, when running a Java client against a CDH 5.0.0 cluster, I am able to set the environment to {{./*:$HADOOP_CLIENT_CONF_DIR:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_HDFS_HOME/*:$HADOOP_HDFS_HOME/lib/*:$HADOOP_YARN_HOME/*:$HADOOP_YARN_HOME/lib/*}} and it works - the environment variables are resolved. Maybe it is the way CDH is setting things up, but the path is not fully resolved when the client specifies it. Problems with CLASSPATH in Job Submission REST API -- Key: YARN-2625 URL: https://issues.apache.org/jira/browse/YARN-2625 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.5.1 Reporter: Doug Haigh There are a couple of issues I have found specifying the CLASSPATH environment variable using the REST API. 1) In the Java client, the CLASSPATH environment is usually made up of either the value of the yarn.application.classpath in yarn-site.xml value or the default YARN classpath value as defined by YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH. REST API consumers have no method of telling the resource manager to use the default unless they hardcode the default value themselves. If the default ever changes, the code would need to change. 2) If any environment variables are used in the CLASSPATH environment 'value' field, they are evaluated when the values are NULL resulting in bad values in the CLASSPATH. For example, if I had hardcoded the CLASSPATH value to the default of $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* the classpath passed to the application master is :/share/hadoop/common/*:/share/hadoop/common/lib/*:/share/hadoop/hdfs/*:/share/hadoop/hdfs/lib/*:/share/hadoop/yarn/*:/share/hadoop/yarn/lib/* These two problems require REST API consumers to always have the fully resolved path defined in the yarn.application.classpath value. If the property is missing or contains environment varaibles, the application created by the REST API will fail due to the CLASSPATH being incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.22.patch Made registerApplicationMaster Idempotent and marked registerApplicationMaser/finishApplicationMaster Idempotent. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160419#comment-14160419 ] Hadoop QA commented on YARN-1879: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673093/YARN-1879.22.patch against trunk revision ed841dd. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5273//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5273//console This message is automatically generated. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2625) Problems with CLASSPATH in Job Submission REST API
[ https://issues.apache.org/jira/browse/YARN-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160455#comment-14160455 ] Doug Haigh commented on YARN-2625: -- To be honest, I never knew about that option - but that does not get me anything more than what I can read from the yarn-site.xml file (although I like not having to have those files around). It still has the two problems described above because 1) If the {{yarn.application.classpath}} value is not specified, I still have no way to know the default classpath 2) If the {{yarn.application.classpath}} value has environment variables in it, they still need to be resolved somehow. If the value returned by that URL was the *resolved* classpath, that would be the work. Problems with CLASSPATH in Job Submission REST API -- Key: YARN-2625 URL: https://issues.apache.org/jira/browse/YARN-2625 Project: Hadoop YARN Issue Type: Bug Components: api Affects Versions: 2.5.1 Reporter: Doug Haigh There are a couple of issues I have found specifying the CLASSPATH environment variable using the REST API. 1) In the Java client, the CLASSPATH environment is usually made up of either the value of the yarn.application.classpath in yarn-site.xml value or the default YARN classpath value as defined by YarnConfiguration.DEFAULT_YARN_CROSS_PLATFORM_APPLICATION_CLASSPATH. REST API consumers have no method of telling the resource manager to use the default unless they hardcode the default value themselves. If the default ever changes, the code would need to change. 2) If any environment variables are used in the CLASSPATH environment 'value' field, they are evaluated when the values are NULL resulting in bad values in the CLASSPATH. For example, if I had hardcoded the CLASSPATH value to the default of $HADOOP_CONF_DIR, $HADOOP_COMMON_HOME/share/hadoop/common/*, $HADOOP_COMMON_HOME/share/hadoop/common/lib/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/*, $HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*, $HADOOP_YARN_HOME/share/hadoop/yarn/*, $HADOOP_YARN_HOME/share/hadoop/yarn/lib/* the classpath passed to the application master is :/share/hadoop/common/*:/share/hadoop/common/lib/*:/share/hadoop/hdfs/*:/share/hadoop/hdfs/lib/*:/share/hadoop/yarn/*:/share/hadoop/yarn/lib/* These two problems require REST API consumers to always have the fully resolved path defined in the yarn.application.classpath value. If the property is missing or contains environment varaibles, the application created by the REST API will fail due to the CLASSPATH being incorrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160481#comment-14160481 ] Jason Lowe commented on YARN-2312: -- Patch looks good overall. Just two minor nits: The containerId.getContainerId() 0xffL construct shows up a number of times. Wondering if there should be a utility method on ContainerId to provide this value or if the masking constant should be obtainable from ContainerId. TestTaskAttemptListenerImpl.java has no changes releated to this JIRA, including wrapping of lines that are already less than 80 columns. Not a must-fix just seemed like unnecessary extra changes. [~jianhe] do you have any additional comments? Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (YARN-2646) distributed shell tests to use registry
[ https://issues.apache.org/jira/browse/YARN-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14159835#comment-14159835 ] Karthik Kambatla edited comment on YARN-2646 at 10/6/14 4:48 PM: - This is the part of the YARN-913 patch which modified distributed shell to (optionally) register itself, and tests to verify that it does this, and that app-attempt-id persistent entries are purged when the app finishes was (Author: ste...@apache.org): This is the part of the YARn-913 patch which modified distributed shell to (optionally) register itself, and tests to verify that it does this, and that app-attempt-id persistent entries are purged when the app finishes distributed shell tests to use registry - Key: YARN-2646 URL: https://issues.apache.org/jira/browse/YARN-2646 Project: Hadoop YARN Issue Type: Sub-task Components: api, resourcemanager Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2646-001.patch for testing and for an example, the Distributed Shell should create a record for itself in the service registry ... the tests can look for this. This will act as a test for the RM integration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2644) Recalculate headroom more frequently to keep it accurate
[ https://issues.apache.org/jira/browse/YARN-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160492#comment-14160492 ] Hadoop QA commented on YARN-2644: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12672890/YARN-2644.14.patch against trunk revision ed841dd. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5274//console This message is automatically generated. Recalculate headroom more frequently to keep it accurate Key: YARN-2644 URL: https://issues.apache.org/jira/browse/YARN-2644 Project: Hadoop YARN Issue Type: Sub-task Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2644.11.patch, YARN-2644.14.patch See parent (1198) for more detail - this specifically covers calculating the headroom more frequently, to cover the cases where changes have occurred which impact headroom but which are not reflected due to an application not being updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.12.patch Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: (was: YARN-796.node-label.consolidate.12.patch) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Douglas updated YARN-1051: Fix Version/s: (was: 3.0.0) 2.6.0 YARN Admission Control/Planner: enhancing the resource allocation model with time. -- Key: YARN-1051 URL: https://issues.apache.org/jira/browse/YARN-1051 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler Reporter: Carlo Curino Assignee: Carlo Curino Fix For: 2.6.0 Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, techreport.pdf In this umbrella JIRA we propose to extend the YARN RM to handle time explicitly, allowing users to reserve capacity over time. This is an important step towards SLAs, long-running services, workflows, and helps for gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160565#comment-14160565 ] Hudson commented on YARN-1051: -- FAILURE: Integrated in Hadoop-trunk-Commit #6197 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6197/]) Move YARN-1051 to 2.6 (cdouglas: rev 8380ca37237a21638e1bcad0dd0e4c7e9f1a1786) * hadoop-yarn-project/CHANGES.txt YARN Admission Control/Planner: enhancing the resource allocation model with time. -- Key: YARN-1051 URL: https://issues.apache.org/jira/browse/YARN-1051 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler Reporter: Carlo Curino Assignee: Carlo Curino Fix For: 2.6.0 Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, techreport.pdf In this umbrella JIRA we propose to extend the YARN RM to handle time explicitly, allowing users to reserve capacity over time. This is an important step towards SLAs, long-running services, workflows, and helps for gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2615) ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields
[ https://issues.apache.org/jira/browse/YARN-2615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160576#comment-14160576 ] Hudson commented on YARN-2615: -- FAILURE: Integrated in Hadoop-trunk-Commit #6198 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6198/]) YARN-2615. Changed ClientToAMTokenIdentifier/RM(Timeline)DelegationTokenIdentifier to use protobuf as payload. Contributed by Junping Du (jianhe: rev ea26cc0b4ac02b8af686dfda80f540e5aa70c358) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/AMRMTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/proto/test_client_tokens.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/YARNDelegationTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestClientRMTokens.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/RMDelegationTokenIdentifierForTest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/ClientToAMTokenIdentifierForTest.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/ZKRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/RMDelegationTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/TimelineDelegationTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/ClientToAMTokenIdentifier.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/NMTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/security/TestClientToAMTokens.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/proto/server/yarn_security_token.proto * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/pom.xml ClientToAMTokenIdentifier and DelegationTokenIdentifier should allow extended fields Key: YARN-2615 URL: https://issues.apache.org/jira/browse/YARN-2615 Project: Hadoop YARN Issue Type: Sub-task Reporter: Junping Du Assignee: Junping Du Priority: Blocker Attachments: YARN-2615-v2.patch, YARN-2615-v3.patch, YARN-2615-v4.patch, YARN-2615-v5.patch, YARN-2615.patch As three TokenIdentifiers get updated in YARN-668, ClientToAMTokenIdentifier and DelegationTokenIdentifier should also be updated in the same way to allow fields get extended in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2544) [YARN-796] Common server side PB changes (not include user API PB changes)
[ https://issues.apache.org/jira/browse/YARN-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2544: - Attachment: YARN-2544.patch [YARN-796] Common server side PB changes (not include user API PB changes) -- Key: YARN-2544 URL: https://issues.apache.org/jira/browse/YARN-2544 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2544.patch, YARN-2544.patch, YARN-2544.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2494) [YARN-796] Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2494: - Attachment: YARN-2494.patch [YARN-796] Node label manager API and storage implementations - Key: YARN-2494 URL: https://issues.apache.org/jira/browse/YARN-2494 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch This JIRA includes APIs and storage implementations of node label manager, NodeLabelManager is an abstract class used to manage labels of nodes in the cluster, it has APIs to query/modify - Nodes according to given label - Labels according to given hostname - Add/remove labels - Set labels of nodes in the cluster - Persist/recover changes of labels/labels-on-nodes to/from storage And it has two implementations to store modifications - Memory based storage: It will not persist changes, so all labels will be lost when RM restart - FileSystem based storage: It will persist/recover to/from FileSystem (like HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2647) [YARN-796] Add yarn queue CLI to get queue info including labels of such queue
Wangda Tan created YARN-2647: Summary: [YARN-796] Add yarn queue CLI to get queue info including labels of such queue Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2544) [YARN-796] Common server side PB changes (not include user API PB changes)
[ https://issues.apache.org/jira/browse/YARN-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160603#comment-14160603 ] Wangda Tan commented on YARN-2544: -- bq. We will also need a GetLabelsOfQueueRequest? I think we don't need this. Instead, we need add a yarn queue cli to get all related info of queue. Open up YARN-2647 to track that changes. [YARN-796] Common server side PB changes (not include user API PB changes) -- Key: YARN-2544 URL: https://issues.apache.org/jira/browse/YARN-2544 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2544.patch, YARN-2544.patch, YARN-2544.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1051) YARN Admission Control/Planner: enhancing the resource allocation model with time.
[ https://issues.apache.org/jira/browse/YARN-1051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160613#comment-14160613 ] Subru Krishnan commented on YARN-1051: -- Thanks [~chris.douglas] for shepherding us all the way through. Thanks to all others (you know who you are :)) who took the time to review and whose insightful feedback helped us get this into a much better shape. YARN Admission Control/Planner: enhancing the resource allocation model with time. -- Key: YARN-1051 URL: https://issues.apache.org/jira/browse/YARN-1051 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler, resourcemanager, scheduler Reporter: Carlo Curino Assignee: Carlo Curino Fix For: 2.6.0 Attachments: YARN-1051-design.pdf, YARN-1051.1.patch, YARN-1051.patch, curino_MSR-TR-2013-108.pdf, socc14-paper15.pdf, techreport.pdf In this umbrella JIRA we propose to extend the YARN RM to handle time explicitly, allowing users to reserve capacity over time. This is an important step towards SLAs, long-running services, workflows, and helps for gang scheduling. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1707: - Fix Version/s: 2.6.0 Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Fix For: 2.6.0 Attachments: YARN-1707.10.patch, YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.5.patch, YARN-1707.6.patch, YARN-1707.7.patch, YARN-1707.8.patch, YARN-1707.9.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2496: - Attachment: YARN-2496.patch Attached updated patch [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2566: Attachment: (was: YARN-2566.003.patch) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_01 2014-09-13 23:33:25,187 INFO
[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2566: Attachment: YARN-2566.003.patch IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_01 2014-09-13 23:33:25,187 INFO
[jira] [Updated] (YARN-1707) Making the CapacityScheduler more dynamic
[ https://issues.apache.org/jira/browse/YARN-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1707: - Target Version/s: (was: 3.0.0) Making the CapacityScheduler more dynamic - Key: YARN-1707 URL: https://issues.apache.org/jira/browse/YARN-1707 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Carlo Curino Assignee: Carlo Curino Labels: capacity-scheduler Fix For: 2.6.0 Attachments: YARN-1707.10.patch, YARN-1707.2.patch, YARN-1707.3.patch, YARN-1707.4.patch, YARN-1707.5.patch, YARN-1707.6.patch, YARN-1707.7.patch, YARN-1707.8.patch, YARN-1707.9.patch, YARN-1707.patch The CapacityScheduler is a rather static at the moment, and refreshqueue provides a rather heavy-handed way to reconfigure it. Moving towards long-running services (tracked in YARN-896) and to enable more advanced admission control and resource parcelling we need to make the CapacityScheduler more dynamic. This is instrumental to the umbrella jira YARN-1051. Concretely this require the following changes: * create queues dynamically * destroy queues dynamically * dynamically change queue parameters (e.g., capacity) * modify refreshqueue validation to enforce sum(child.getCapacity())= 100% instead of ==100% We limit this to LeafQueues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160653#comment-14160653 ] Jian He commented on YARN-2312: --- Patch looks good to me too, thanks Jason for reviewing the patch. bq. Wondering if there should be a utility method on ContainerId to provide this value or if the masking constant should be obtainable from ContainerId. I prefer exposing the constant Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch, YARN-2312.1.patch, YARN-2312.2-2.patch, YARN-2312.2-3.patch, YARN-2312.2.patch, YARN-2312.4.patch, YARN-2312.5.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160654#comment-14160654 ] zhihai xu commented on YARN-2566: - I don't see the problem -1 javac. The patch appears to cause the build to fail. in my local build. Restart the Jenkins test. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
[jira] [Updated] (YARN-1708) Add a public API to reserve resources (part of YARN-1051)
[ https://issues.apache.org/jira/browse/YARN-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1708: - Fix Version/s: 2.6.0 Add a public API to reserve resources (part of YARN-1051) - Key: YARN-1708 URL: https://issues.apache.org/jira/browse/YARN-1708 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-1708.patch, YARN-1708.patch, YARN-1708.patch, YARN-1708.patch This JIRA tracks the definition of a new public API for YARN, which allows users to reserve resources (think of time-bounded queues). This is part of the admission control enhancement proposed in YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1709: - Fix Version/s: 2.6.0 Admission Control: Reservation subsystem Key: YARN-1709 URL: https://issues.apache.org/jira/browse/YARN-1709 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1709) Admission Control: Reservation subsystem
[ https://issues.apache.org/jira/browse/YARN-1709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1709: - Target Version/s: (was: 3.0.0) Admission Control: Reservation subsystem Key: YARN-1709 URL: https://issues.apache.org/jira/browse/YARN-1709 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch, YARN-1709.patch This JIRA is about the key data structure used to track resources over time to enable YARN-1051. The Reservation subsystem is conceptually a plan of how the scheduler will allocate resources over-time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1710) Admission Control: agents to allocate reservation
[ https://issues.apache.org/jira/browse/YARN-1710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1710: - Fix Version/s: 2.6.0 Admission Control: agents to allocate reservation - Key: YARN-1710 URL: https://issues.apache.org/jira/browse/YARN-1710 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Fix For: 2.6.0 Attachments: YARN-1710.1.patch, YARN-1710.2.patch, YARN-1710.3.patch, YARN-1710.4.patch, YARN-1710.patch This JIRA tracks the algorithms used to allocate a user ReservationRequest coming in from the new reservation API (YARN-1708), in the inventory subsystem (YARN-1709) maintaining the current plan for the cluster. The focus of this agents is to quickly find a solution for the set of contraints provided by the user, and the physical constraints of the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1711: - Fix Version/s: 2.6.0 CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Fix For: 2.6.0 Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.3.patch, YARN-1711.4.patch, YARN-1711.5.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1711) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709
[ https://issues.apache.org/jira/browse/YARN-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1711: - Target Version/s: (was: 3.0.0) CapacityOverTimePolicy: a policy to enforce quotas over time for YARN-1709 -- Key: YARN-1711 URL: https://issues.apache.org/jira/browse/YARN-1711 Project: Hadoop YARN Issue Type: Sub-task Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations Fix For: 2.6.0 Attachments: YARN-1711.1.patch, YARN-1711.2.patch, YARN-1711.3.patch, YARN-1711.4.patch, YARN-1711.5.patch, YARN-1711.patch This JIRA tracks the development of a policy that enforces user quotas (a time-extension of the notion of capacity) in the inventory subsystem discussed in YARN-1709. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2080: - Fix Version/s: 2.6.0 Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subru Krishnan Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1712) Admission Control: plan follower
[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1712: - Fix Version/s: 2.6.0 Admission Control: plan follower Key: YARN-1712 URL: https://issues.apache.org/jira/browse/YARN-1712 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations, scheduler Fix For: 2.6.0 Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.3.patch, YARN-1712.4.patch, YARN-1712.5.patch, YARN-1712.patch This JIRA tracks a thread that continuously propagates the current state of an reservation subsystem to the scheduler. As the inventory subsystem store the plan of how the resources should be subdivided, the work we propose in this JIRA realizes such plan by dynamically instructing the CapacityScheduler to add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1712) Admission Control: plan follower
[ https://issues.apache.org/jira/browse/YARN-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-1712: - Target Version/s: (was: 3.0.0) Admission Control: plan follower Key: YARN-1712 URL: https://issues.apache.org/jira/browse/YARN-1712 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager Reporter: Carlo Curino Assignee: Carlo Curino Labels: reservations, scheduler Fix For: 2.6.0 Attachments: YARN-1712.1.patch, YARN-1712.2.patch, YARN-1712.3.patch, YARN-1712.4.patch, YARN-1712.5.patch, YARN-1712.patch This JIRA tracks a thread that continuously propagates the current state of an reservation subsystem to the scheduler. As the inventory subsystem store the plan of how the resources should be subdivided, the work we propose in this JIRA realizes such plan by dynamically instructing the CapacityScheduler to add/remove/resize queues to follow the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2080: - Target Version/s: (was: 3.0.0) Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subru Krishnan Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2378) Adding support for moving apps between queues in Capacity Scheduler
[ https://issues.apache.org/jira/browse/YARN-2378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2378: - Target Version/s: (was: 2.6.0) Adding support for moving apps between queues in Capacity Scheduler --- Key: YARN-2378 URL: https://issues.apache.org/jira/browse/YARN-2378 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Labels: capacity-scheduler Fix For: 2.6.0 Attachments: YARN-2378-1.patch, YARN-2378.patch, YARN-2378.patch, YARN-2378.patch, YARN-2378.patch As discussed with [~leftnoteasy] and [~jianhe], we are breaking up YARN-1707 to smaller patches for manageability. This JIRA will address adding support for moving apps between queues in Capacity Scheduler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2385) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2385: - Fix Version/s: 2.6.0 Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Labels: abstractyarnscheduler Fix For: 2.6.0 Currently getAppsinQueue returns both pending running apps. The purpose of the JIRA is to explore splitting it to getRunningAppsInQueue + getPendingAppsInQueue that will provide more flexibility to callers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2385) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2385: - Fix Version/s: (was: 2.6.0) Consider splitting getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Labels: abstractyarnscheduler Currently getAppsinQueue returns both pending running apps. The purpose of the JIRA is to explore splitting it to getRunningAppsInQueue + getPendingAppsInQueue that will provide more flexibility to callers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2611: - Fix Version/s: 2.6.0 Fix jenkins findbugs warning and test case failures for trunk merge patch - Key: YARN-2611 URL: https://issues.apache.org/jira/browse/YARN-2611 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-2611.patch This JIRA is to fix jenkins findbugs warnings and test case failures for trunk merge patch as [reported | https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in YARN-1051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2611) Fix jenkins findbugs warning and test case failures for trunk merge patch
[ https://issues.apache.org/jira/browse/YARN-2611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2611: - Target Version/s: (was: 3.0.0) Fix jenkins findbugs warning and test case failures for trunk merge patch - Key: YARN-2611 URL: https://issues.apache.org/jira/browse/YARN-2611 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-2611.patch This JIRA is to fix jenkins findbugs warnings and test case failures for trunk merge patch as [reported | https://issues.apache.org/jira/browse/YARN-1051?focusedCommentId=14148506] in YARN-1051 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2576) Prepare yarn-1051 branch for merging with trunk
[ https://issues.apache.org/jira/browse/YARN-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subru Krishnan updated YARN-2576: - Target Version/s: (was: 3.0.0) Prepare yarn-1051 branch for merging with trunk --- Key: YARN-2576 URL: https://issues.apache.org/jira/browse/YARN-2576 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, resourcemanager, scheduler Reporter: Subru Krishnan Assignee: Subru Krishnan Fix For: 2.6.0 Attachments: YARN-2576.patch, YARN-2576.patch This JIRA is to track the changes required to ensure branch yarn-1051 is ready to be merged with trunk. This includes fixing any compilation issues, findbug and/or javadoc warning, test cases failures, etc if any. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2544) [YARN-796] Common server side PB changes (not include user API PB changes)
[ https://issues.apache.org/jira/browse/YARN-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160673#comment-14160673 ] Hadoop QA commented on YARN-2544: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673138/YARN-2544.patch against trunk revision ea26cc0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5275//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5275//console This message is automatically generated. [YARN-796] Common server side PB changes (not include user API PB changes) -- Key: YARN-2544 URL: https://issues.apache.org/jira/browse/YARN-2544 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2544.patch, YARN-2544.patch, YARN-2544.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160687#comment-14160687 ] Hadoop QA commented on YARN-2566: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673150/YARN-2566.003.patch against trunk revision ea26cc0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5276//console This message is automatically generated. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987)
[jira] [Created] (YARN-2648) need mechanism for updating HDFS delegation tokens associated with container launch contexts
Jonathan Maron created YARN-2648: Summary: need mechanism for updating HDFS delegation tokens associated with container launch contexts Key: YARN-2648 URL: https://issues.apache.org/jira/browse/YARN-2648 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager Reporter: Jonathan Maron During the launch of a container, the required delegation tokens (e.g. HDFS) are passed to the launch context. If those tokens expire and the container requires a restart the restart attempt will fail. Sample log output: 2014-10-06 18:37:28,609 WARN ipc.Client (Client.java:run(675)) - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 124 for hbase) can't be found in cache -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2644) Recalculate headroom more frequently to keep it accurate
[ https://issues.apache.org/jira/browse/YARN-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2644: -- Attachment: YARN-2644.15.patch Update patch against latest trunk in new(er) git repo Recalculate headroom more frequently to keep it accurate Key: YARN-2644 URL: https://issues.apache.org/jira/browse/YARN-2644 Project: Hadoop YARN Issue Type: Sub-task Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2644.11.patch, YARN-2644.14.patch, YARN-2644.15.patch See parent (1198) for more detail - this specifically covers calculating the headroom more frequently, to cover the cases where changes have occurred which impact headroom but which are not reflected due to an application not being updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1857: -- Attachment: YARN-1857.5.patch Updating to current trunk on new(er) repo CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Priority: Critical Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2633) TestContainerLauncherImpl sometimes fails
[ https://issues.apache.org/jira/browse/YARN-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mit Desai updated YARN-2633: Attachment: YARN-2633.patch Attaching the patch. The caught exception was to be ignored. Still throwing YarnRuntimeException from the catch clause did not make sense. deleting the line that throws the exception. TestContainerLauncherImpl sometimes fails - Key: YARN-2633 URL: https://issues.apache.org/jira/browse/YARN-2633 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2633.patch {noformat} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.yarn.api.ContainerManagementProtocol$$EnhancerByMockitoWithCGLIB$$25708415.close() at java.lang.Class.getMethod(Class.java:1665) at org.apache.hadoop.yarn.factories.impl.pb.RpcClientFactoryPBImpl.stopClient(RpcClientFactoryPBImpl.java:90) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.stopProxy(HadoopYarnProtoRPC.java:54) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.mayBeCloseProxy(ContainerManagementProtocolProxy.java:79) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:225) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.shutdownAllContainers(ContainerLauncherImpl.java:320) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.serviceStop(ContainerLauncherImpl.java:331) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.mapreduce.v2.app.launcher.TestContainerLauncherImpl.testMyShutdown(TestContainerLauncherImpl.java:315) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160819#comment-14160819 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673170/YARN-1857.5.patch against trunk revision ea26cc0. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5279//console This message is automatically generated. CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Priority: Critical Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2500: - Attachment: YARN-2500.patch [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch, YARN-2500.patch, YARN-2500.patch, YARN-2500.patch, YARN-2500.patch This patches contains changes in ResourceManager to support labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160858#comment-14160858 ] Wangda Tan commented on YARN-2500: -- Regarding comments from [~vinodkv], bq. As with other patches, Labels - NodeLabels. You'll need to change all of the following:... Addressed bq. ApplicationMasterService: There are multiple this.rmContext.getRMApps().get(appAttemptId.getApplicationId() calls in the the allocate method. Refactor to avoid dup calls. Addressed bq. TestSchedulerUtils: testValidateResourceRequestWithErrorLabelsPermission: Why are and accepted when only x and y are recognized labels? Empty label expression should be accept by any queue, and will be trimmed to empty. bq. Given we don't support yet other features in ResourceRequest for the AM container like priority, locality, shall we also hard-code them to AM_CONTAINER_PRIORITY, ResourceRequest.ANY respectively too? Agree, now set values to default for priority/#container/resource-name/relax-locality bq. Can we add test-case for num-containers, priority, locality for AM container? Added test testScheduleTransitionReplaceAMContainerRequestWithDefaults in RMAppAttemptImpl. Please kindly review, Thanks, Wangda [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch, YARN-2500.patch, YARN-2500.patch, YARN-2500.patch, YARN-2500.patch This patches contains changes in ResourceManager to support labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2629: -- Attachment: YARN-2629.2.patch Add one more option -create, which make client only try to create a new domain once this flag is set. In addition, fix an existing problem in DS AM. AM should use the submitter UGI to put the entities. Create a patch of these changes. Make distributed shell use the domain-based timeline ACLs - Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2629.1.patch, YARN-2629.2.patch For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.12.patch Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2583: Attachment: YARN-2583.2.patch Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160963#comment-14160963 ] Hadoop QA commented on YARN-2629: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673184/YARN-2629.2.patch against trunk revision 3affad9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5281//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5281//console This message is automatically generated. Make distributed shell use the domain-based timeline ACLs - Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2629.1.patch, YARN-2629.2.patch For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160989#comment-14160989 ] Hadoop QA commented on YARN-2583: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673189/YARN-2583.2.patch against trunk revision 3affad9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1268 javac compiler warnings (more than the trunk's current 1267 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.logaggregation.TestAggregatedLogDeletionService org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5283//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5283//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5283//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5283//console This message is automatically generated. Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2644) Recalculate headroom more frequently to keep it accurate
[ https://issues.apache.org/jira/browse/YARN-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2644: -- Attachment: YARN-2644.15.patch Reupload to see if jenkins can apply the patch now Recalculate headroom more frequently to keep it accurate Key: YARN-2644 URL: https://issues.apache.org/jira/browse/YARN-2644 Project: Hadoop YARN Issue Type: Sub-task Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2644.11.patch, YARN-2644.14.patch, YARN-2644.15.patch, YARN-2644.15.patch See parent (1198) for more detail - this specifically covers calculating the headroom more frequently, to cover the cases where changes have occurred which impact headroom but which are not reflected due to an application not being updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2633) TestContainerLauncherImpl sometimes fails
[ https://issues.apache.org/jira/browse/YARN-2633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161038#comment-14161038 ] Hadoop QA commented on YARN-2633: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673174/YARN-2633.patch against trunk revision 687d83c. {color:red}-1 patch{color}. Trunk compilation may be broken. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5286//console This message is automatically generated. TestContainerLauncherImpl sometimes fails - Key: YARN-2633 URL: https://issues.apache.org/jira/browse/YARN-2633 Project: Hadoop YARN Issue Type: Bug Reporter: Mit Desai Assignee: Mit Desai Attachments: YARN-2633.patch {noformat} org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.yarn.api.ContainerManagementProtocol$$EnhancerByMockitoWithCGLIB$$25708415.close() at java.lang.Class.getMethod(Class.java:1665) at org.apache.hadoop.yarn.factories.impl.pb.RpcClientFactoryPBImpl.stopClient(RpcClientFactoryPBImpl.java:90) at org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC.stopProxy(HadoopYarnProtoRPC.java:54) at org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy.mayBeCloseProxy(ContainerManagementProtocolProxy.java:79) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:225) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.shutdownAllContainers(ContainerLauncherImpl.java:320) at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl.serviceStop(ContainerLauncherImpl.java:331) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.mapreduce.v2.app.launcher.TestContainerLauncherImpl.testMyShutdown(TestContainerLauncherImpl.java:315) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2644) Recalculate headroom more frequently to keep it accurate
[ https://issues.apache.org/jira/browse/YARN-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161148#comment-14161148 ] Hadoop QA commented on YARN-2644: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673200/YARN-2644.15.patch against trunk revision 3affad9. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5285//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5285//console This message is automatically generated. Recalculate headroom more frequently to keep it accurate Key: YARN-2644 URL: https://issues.apache.org/jira/browse/YARN-2644 Project: Hadoop YARN Issue Type: Sub-task Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2644.11.patch, YARN-2644.14.patch, YARN-2644.15.patch, YARN-2644.15.patch See parent (1198) for more detail - this specifically covers calculating the headroom more frequently, to cover the cases where changes have occurred which impact headroom but which are not reflected due to an application not being updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2644) Recalculate headroom more frequently to keep it accurate
[ https://issues.apache.org/jira/browse/YARN-2644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161152#comment-14161152 ] Jian He commented on YARN-2644: --- looks good, committing Recalculate headroom more frequently to keep it accurate Key: YARN-2644 URL: https://issues.apache.org/jira/browse/YARN-2644 Project: Hadoop YARN Issue Type: Sub-task Reporter: Craig Welch Assignee: Craig Welch Attachments: YARN-2644.11.patch, YARN-2644.14.patch, YARN-2644.15.patch, YARN-2644.15.patch See parent (1198) for more detail - this specifically covers calculating the headroom more frequently, to cover the cases where changes have occurred which impact headroom but which are not reflected due to an application not being updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161155#comment-14161155 ] Hadoop QA commented on YARN-2496: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673224/YARN-2496.patch against trunk revision 8dc6abf. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 12 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5287//console This message is automatically generated. [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1061) NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager.
[ https://issues.apache.org/jira/browse/YARN-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161160#comment-14161160 ] Wilfred Spiegelenburg commented on YARN-1061: - This is a dupe from YARN-2578. Writes do not time out and they should. NodeManager is indefinitely waiting for nodeHeartBeat() response from ResouceManager. - Key: YARN-1061 URL: https://issues.apache.org/jira/browse/YARN-1061 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.0.5-alpha Reporter: Rohith It is observed that in one of the scenario, NodeManger is indefinetly waiting for nodeHeartbeat response from ResouceManger where ResouceManger is in hanged up state. NodeManager should get timeout exception instead of waiting indefinetly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161164#comment-14161164 ] Jian He commented on YARN-1857: --- could you please update the patch on top of YARN-2644 ? comments in the meanwhile: - update the code comments about the new calculation of headroom {code} /** * Headroom is min((userLimit, queue-max-cap) - consumed) */ {code} - indentation of this line {{Resources.subtract(queueMaxCap, usedResources));}} CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Priority: Critical Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2649) Flaky test TestAMRMRPCNodeUpdates
Ming Ma created YARN-2649: - Summary: Flaky test TestAMRMRPCNodeUpdates Key: YARN-2649 URL: https://issues.apache.org/jira/browse/YARN-2649 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Sometimes the test fails with the following error: testAMRMUnusableNodes(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates) Time elapsed: 41.73 sec FAILURE! junit.framework.AssertionFailedError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:382) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:125) When this happens, SchedulerEventType.NODE_UPDATE was processed before RMAppAttemptEvent.ATTEMPT_ADDED was processed. That is possible, given the test only waits for RMAppState.ACCEPTED before having NM sending heartbeat. This can be reproduced using custom AsyncDispatcher with CountDownLatch. Here is the log when this happens. {noformat} App State is : ACCEPTED 2014-10-05 21:25:07,305 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) - appattempt_1412569506932_0001_01 State change from NEW to SUBMITTED 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType: STATUS_UPDATE 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(384)) - Processing 127.0.0.1:1234 of type STATUS_UPDATE AppAttempt : appattempt_1412569506932_0001_01 State is : SUBMITTED Waiting for state : ALLOCATED 2014-10-05 21:25:07,306 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.AppAttemptAddedSchedulerEvent.EventType: APP_ATTEMPT_ADDED 2014-10-05 21:25:07,328 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-10-05 21:25:07,330 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEvent.EventType: ATTEMPT_ADDED 2014-10-05 21:25:07,331 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(658)) - Processing event for appattempt_1412569506932_0001_000 001 of type ATTEMPT_ADDED 2014-10-05 21:25:07,333 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) - appattempt_1412569506932_0001_01 State change from SUBMITTED to SCHEDULED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2649) Flaky test TestAMRMRPCNodeUpdates
[ https://issues.apache.org/jira/browse/YARN-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated YARN-2649: -- Attachment: YARN-2649.patch Fix the test code to wait until RMAppAttemptImpl gets to RMAppAttemptState.SCHEDULED state before having the nm heartbeat. Another way to fix it is to change MockRM.submitApp to waitForState on RMAppAttempt. That might address other test cases that use MockRM.submitApp. Flaky test TestAMRMRPCNodeUpdates - Key: YARN-2649 URL: https://issues.apache.org/jira/browse/YARN-2649 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Attachments: YARN-2649.patch Sometimes the test fails with the following error: testAMRMUnusableNodes(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates) Time elapsed: 41.73 sec FAILURE! junit.framework.AssertionFailedError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:382) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:125) When this happens, SchedulerEventType.NODE_UPDATE was processed before RMAppAttemptEvent.ATTEMPT_ADDED was processed. That is possible, given the test only waits for RMAppState.ACCEPTED before having NM sending heartbeat. This can be reproduced using custom AsyncDispatcher with CountDownLatch. Here is the log when this happens. {noformat} App State is : ACCEPTED 2014-10-05 21:25:07,305 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) - appattempt_1412569506932_0001_01 State change from NEW to SUBMITTED 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType: STATUS_UPDATE 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(384)) - Processing 127.0.0.1:1234 of type STATUS_UPDATE AppAttempt : appattempt_1412569506932_0001_01 State is : SUBMITTED Waiting for state : ALLOCATED 2014-10-05 21:25:07,306 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.AppAttemptAddedSchedulerEvent.EventType: APP_ATTEMPT_ADDED 2014-10-05 21:25:07,328 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-10-05 21:25:07,330 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEvent.EventType: ATTEMPT_ADDED 2014-10-05 21:25:07,331 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(658)) - Processing event for appattempt_1412569506932_0001_000 001 of type ATTEMPT_ADDED 2014-10-05 21:25:07,333 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) - appattempt_1412569506932_0001_01 State change from SUBMITTED to SCHEDULED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2629: -- Attachment: YARN-2629.3.patch Upload a new patch: 1. Fix the test failure 2. Remove two lines of unnecessary code in TimelineClientImpl 3. Improve the code of publishing entities in DS AM Make distributed shell use the domain-based timeline ACLs - Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2629.1.patch, YARN-2629.2.patch, YARN-2629.3.patch For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2649) Flaky test TestAMRMRPCNodeUpdates
[ https://issues.apache.org/jira/browse/YARN-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161210#comment-14161210 ] Jian He commented on YARN-2649: --- [~mingma], thanks for working on this ! bq. Another way to fix it is to change MockRM.submitApp to waitForState on RMAppAttempt. That might address other test cases that use MockRM.submitApp. I recently saw some other similar test failure e.g. YARN-2483. maybe this is what we should do. could you also run all tests locally, in case we don't introduce regression failure? thx Flaky test TestAMRMRPCNodeUpdates - Key: YARN-2649 URL: https://issues.apache.org/jira/browse/YARN-2649 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Attachments: YARN-2649.patch Sometimes the test fails with the following error: testAMRMUnusableNodes(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates) Time elapsed: 41.73 sec FAILURE! junit.framework.AssertionFailedError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:382) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:125) When this happens, SchedulerEventType.NODE_UPDATE was processed before RMAppAttemptEvent.ATTEMPT_ADDED was processed. That is possible, given the test only waits for RMAppState.ACCEPTED before having NM sending heartbeat. This can be reproduced using custom AsyncDispatcher with CountDownLatch. Here is the log when this happens. {noformat} App State is : ACCEPTED 2014-10-05 21:25:07,305 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) - appattempt_1412569506932_0001_01 State change from NEW to SUBMITTED 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType: STATUS_UPDATE 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(384)) - Processing 127.0.0.1:1234 of type STATUS_UPDATE AppAttempt : appattempt_1412569506932_0001_01 State is : SUBMITTED Waiting for state : ALLOCATED 2014-10-05 21:25:07,306 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.AppAttemptAddedSchedulerEvent.EventType: APP_ATTEMPT_ADDED 2014-10-05 21:25:07,328 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-10-05 21:25:07,330 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEvent.EventType: ATTEMPT_ADDED 2014-10-05 21:25:07,331 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(658)) - Processing event for appattempt_1412569506932_0001_000 001 of type ATTEMPT_ADDED 2014-10-05 21:25:07,333 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) - appattempt_1412569506932_0001_01 State change from SUBMITTED to SCHEDULED {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2641) improve node decommission latency in RM.
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2641: Attachment: YARN-2641.000.patch improve node decommission latency in RM. Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2641.000.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161234#comment-14161234 ] Hadoop QA commented on YARN-2583: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673234/YARN-2583.3.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1268 javac compiler warnings (more than the trunk's current 1267 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5289//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5289//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5289//console This message is automatically generated. Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch, YARN-2583.3.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2583: Attachment: YARN-2583.3.1.patch Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch, YARN-2583.3.1.patch, YARN-2583.3.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161238#comment-14161238 ] Hadoop QA commented on YARN-2583: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673234/YARN-2583.3.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1268 javac compiler warnings (more than the trunk's current 1267 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5290//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5290//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5290//console This message is automatically generated. Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch, YARN-2583.3.1.patch, YARN-2583.3.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161241#comment-14161241 ] Tsuyoshi OZAWA commented on YARN-1879: -- For now, I have no idea to reconstruct same response after failover. Currently latest patch only return empty response. This is one discussion point of this design. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2629) Make distributed shell use the domain-based timeline ACLs
[ https://issues.apache.org/jira/browse/YARN-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161246#comment-14161246 ] Hadoop QA commented on YARN-2629: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673235/YARN-2629.3.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5291//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5291//console This message is automatically generated. Make distributed shell use the domain-based timeline ACLs - Key: YARN-2629 URL: https://issues.apache.org/jira/browse/YARN-2629 Project: Hadoop YARN Issue Type: Sub-task Components: timelineserver Reporter: Zhijie Shen Assignee: Zhijie Shen Attachments: YARN-2629.1.patch, YARN-2629.2.patch, YARN-2629.3.patch For demonstration the usage of this feature (YARN-2102), it's good to make the distributed shell create the domain, and post its timeline entities into this private space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2649) Flaky test TestAMRMRPCNodeUpdates
[ https://issues.apache.org/jira/browse/YARN-2649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161262#comment-14161262 ] Hadoop QA commented on YARN-2649: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673231/YARN-2649.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5288//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5288//console This message is automatically generated. Flaky test TestAMRMRPCNodeUpdates - Key: YARN-2649 URL: https://issues.apache.org/jira/browse/YARN-2649 Project: Hadoop YARN Issue Type: Bug Reporter: Ming Ma Attachments: YARN-2649.patch Sometimes the test fails with the following error: testAMRMUnusableNodes(org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates) Time elapsed: 41.73 sec FAILURE! junit.framework.AssertionFailedError: AppAttempt state is not correct (timedout) expected:ALLOCATED but was:SCHEDULED at junit.framework.Assert.fail(Assert.java:50) at junit.framework.Assert.failNotEquals(Assert.java:287) at junit.framework.Assert.assertEquals(Assert.java:67) at org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82) at org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:382) at org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRMRPCNodeUpdates.testAMRMUnusableNodes(TestAMRMRPCNodeUpdates.java:125) When this happens, SchedulerEventType.NODE_UPDATE was processed before RMAppAttemptEvent.ATTEMPT_ADDED was processed. That is possible, given the test only waits for RMAppState.ACCEPTED before having NM sending heartbeat. This can be reproduced using custom AsyncDispatcher with CountDownLatch. Here is the log when this happens. {noformat} App State is : ACCEPTED 2014-10-05 21:25:07,305 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) - appattempt_1412569506932_0001_01 State change from NEW to SUBMITTED 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStatusEvent.EventType: STATUS_UPDATE 2014-10-05 21:25:07,305 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(384)) - Processing 127.0.0.1:1234 of type STATUS_UPDATE AppAttempt : appattempt_1412569506932_0001_01 State is : SUBMITTED Waiting for state : ALLOCATED 2014-10-05 21:25:07,306 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.AppAttemptAddedSchedulerEvent.EventType: APP_ATTEMPT_ADDED 2014-10-05 21:25:07,328 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeUpdateSchedulerEvent.EventType: NODE_UPDATE 2014-10-05 21:25:07,330 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(164)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptEvent.EventType: ATTEMPT_ADDED 2014-10-05 21:25:07,331 DEBUG [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(658)) - Processing event for appattempt_1412569506932_0001_000 001 of type ATTEMPT_ADDED 2014-10-05 21:25:07,333 INFO [AsyncDispatcher event handler] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:handle(670)) -
[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161278#comment-14161278 ] Hadoop QA commented on YARN-2583: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673242/YARN-2583.3.1.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.securTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5294//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5294//console This message is automatically generated. Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch, YARN-2583.3.1.patch, YARN-2583.3.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161279#comment-14161279 ] Hadoop QA commented on YARN-2583: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673242/YARN-2583.3.1.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.TestLogAggregationService org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch The following test timeouts occurred in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.securTests {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5295//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5295//console This message is automatically generated. Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch, YARN-2583.3.1.patch, YARN-2583.3.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161280#comment-14161280 ] Jian He commented on YARN-1879: --- I think we are mixing two issues in this jira: 1. Mark annotation on protocol for failover. (RM work-preserving failover won’t work without proper protocol annotations. RetryCache won’t help in this scenario, as the cache simply gets cleaned-up after failover/restart) 2. Change the API to return same response for duplicate requests. I propose let’s do 1) first which is what really affects work-preserving RM failover, and do 2) separately. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol --- Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2583) Modify the LogDeletionService to support Log aggregation for LRS
[ https://issues.apache.org/jira/browse/YARN-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161282#comment-14161282 ] Xuan Gong commented on YARN-2583: - Those testcases fail because of binding exception. I do not think they are related. Modify the LogDeletionService to support Log aggregation for LRS Key: YARN-2583 URL: https://issues.apache.org/jira/browse/YARN-2583 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2583.1.patch, YARN-2583.2.patch, YARN-2583.3.1.patch, YARN-2583.3.patch Currently, AggregatedLogDeletionService will delete old logs from HDFS. It will check the cut-off-time, if all logs for this application is older than this cut-off-time. The app-log-dir from HDFS will be deleted. This will not work for LRS. We expect a LRS application can keep running for a long time. Two different scenarios: 1) If we configured the rollingIntervalSeconds, the new log file will be always uploaded to HDFS. The number of log files for this application will become larger and larger. And there is no log files will be deleted. 2) If we did not configure the rollingIntervalSeconds, the log file can only be uploaded to HDFS after the application is finished. It is very possible that the logs are uploaded after the cut-off-time. It will cause problem because at that time the app-log-dir for this application in HDFS has been deleted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-1879: -- Summary: Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over (was: Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161300#comment-14161300 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673238/YARN-1857.6.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5292//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5292//console This message is automatically generated. CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Priority: Critical Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2641) improve node decommission latency in RM.
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161299#comment-14161299 ] Hadoop QA commented on YARN-2641: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673239/YARN-2641.000.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5293//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5293//console This message is automatically generated. improve node decommission latency in RM. Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2641.000.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161353#comment-14161353 ] Sanjay Radia commented on YARN-913: --- Some feedback: # rename {{RegistryOperations.create()}} to {{bind()}} # rename {{org/apache/hadoop/yarn/registry/client/services}} to {{org/apache/hadoop/yarn/registry/client/impl}} # move all ZK classes under {{org/apache/hadoop/yarn/registry/client/impl/zk}}, i.e. the current implementations of the registry client # {{RegistryOperations}} implementations to remove declaration of exceptions other than IOE. # {{RegistryOperations.resolve()} implementation should not mention record headers in exception text: that's an implementation detail # Add README under {{org.apache.hadoop.yarn.registry.server}} to emphasize this is server-side code # Allow {{ServiceRecord}} to support arbitrary key-values # remove {{yarn_id}} {{yarn_persistence}} ffields rom {{ServiceRecord}}, moving them to the set of arbitrary key-values This ensures that there isn't explicit hard-coding of the assumption these are YARN apps from the records. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, YARN-913-014.patch, YARN-913-015.patch, YARN-913-016.patch, YARN-913-017.patch, YARN-913-018.patch, yarnregistry.pdf, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161368#comment-14161368 ] Karthik Kambatla commented on YARN-1879: Thanks for looking at it closely, Jian and Xuan. I have missed some of these points. Spent a little more time thinking through. bq. I think we are mixing two issues in this jira When we mark an API Idempotent or AtMostOnce, the retry-policies will end up re-invoking the API on the other RM in case of a failover. This is okay only if the RM handles these duplicate requests. Further, my understanding is that the behavior of Idempotent APIs should be the same on each invocation; i.e., the client should receive the exact same response too. If we handle duplicate requests but return a different response to the client on duplicate calls, we can mark it AtMostOnce. If we return the same response, we can go ahead and mark it Idempotent. Needless to say, the RM should definitely duplicate requests gracefully. Does that sound reasonable? Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161373#comment-14161373 ] Hadoop QA commented on YARN-1857: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673238/YARN-1857.6.patch against trunk revision 519e5a7. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5296//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5296//console This message is automatically generated. CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Priority: Critical Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2647) [YARN-796] Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161395#comment-14161395 ] Sunil G commented on YARN-2647: --- Hi [~gp.leftnoteasy], I would like to take this up. Thank you. [YARN-796] Add yarn queue CLI to get queue info including labels of such queue -- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-2647) [YARN-796] Add yarn queue CLI to get queue info including labels of such queue
[ https://issues.apache.org/jira/browse/YARN-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-2647: - Assignee: Sunil G [YARN-796] Add yarn queue CLI to get queue info including labels of such queue -- Key: YARN-2647 URL: https://issues.apache.org/jira/browse/YARN-2647 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Wangda Tan Assignee: Sunil G -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161396#comment-14161396 ] Jian He commented on YARN-1879: --- bq. This is okay only if the RM handles these duplicate requests. In RM failover, even if the request from client’s perspective is duplicate; from RM’s perspective, these are just new requests, as the new RM doesn’t have any cache for previous requests from client. Just to unblock this, I suggest marking the annotation now so that the operation can be retried in failover. And discuss the internal implementation separately. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2496: - Attachment: YARN-2496.patch [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.13.patch Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.13.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161403#comment-14161403 ] Hadoop QA commented on YARN-796: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673284/YARN-796.node-label.consolidate.13.patch against trunk revision 519e5a7. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5298//console This message is automatically generated. Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.13.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161402#comment-14161402 ] Hadoop QA commented on YARN-2496: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12673283/YARN-2496.patch against trunk revision 519e5a7. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5297//console This message is automatically generated. [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.13.patch Updated to trunk Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: (was: YARN-796.node-label.consolidate.13.patch) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.13.patch Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.13.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: (was: YARN-796.node-label.consolidate.13.patch) Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.10.patch, YARN-796.node-label.consolidate.11.patch, YARN-796.node-label.consolidate.12.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.consolidate.8.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161465#comment-14161465 ] Steve Loughran commented on YARN-913: - Sanjay, I can do most of these, * w.r.t. the README, we have a javadoc {{package-info.java}}, that's enough. * I propose restricting the custom values that a service record to have to string attributes. support arbitrary JSON opens things up to people embedding entire custom JSON docs in there, which could kill the notion of having semi-standardised records that other apps can work with + published API endpoints for any extra stuff *outside the registry* * I'm going to rename the yarn fields back to {{yarn:id}} and {{yarn:persistence}} if Jersey+jackson marshalls them reliably once they aren't introspection-driven. It makes the yarn-nature of them clearer. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, YARN-913-006.patch, YARN-913-007.patch, YARN-913-008.patch, YARN-913-009.patch, YARN-913-010.patch, YARN-913-011.patch, YARN-913-012.patch, YARN-913-013.patch, YARN-913-014.patch, YARN-913-015.patch, YARN-913-016.patch, YARN-913-017.patch, YARN-913-018.patch, yarnregistry.pdf, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161487#comment-14161487 ] Wangda Tan commented on YARN-2056: -- Hi [~eepayne], Sorry to response so late, I've carefully read and thought about what you mentioned, especially the algorithm. I think it can resolve most issues, but it cannot guarante all case will be resolved. I think in your algorithm, following case will be not correct. {code} total = 100 qA: used = 10, guaranteed = 10, pending = 100 qB: used = 25, guaranteed = 10, pending = 100 (non-preemptable) qC: used = 0, guaranteed = 80, pending = 0 1. In the start, unassign = 100, It will first exclude qB, unassigned = 80 2. It will try to fill qA, qA = qA + 80 * (10/(10 + 80) = 18, unassigned = 72. qC will be removed this turn 3. qB will not be added back here, becaused ideal_assign(qA) = 18, ideal_assign(qB) = 25. 4. All resource will be used by qA. The result should be ideal(qA) = 75, ideal(qB) = 25 {code} And in addition, the remove-then-add-back algorithm seems not very straight-forward to me. In my mind, this problem is like fulfilling water to a water tank like following, some of the tank has stones, make some of them higher than others. Because of water flows, so the result is most equilibrized (which water surface has same height, and some stone can be higher than water surface). {code} _ | | __| X | |X |__ |X| | X X X| |X X X X| --- 1 2 3 4 5 {code} The algorithm may look like, {code} At the beginning, all queue will set ideal_assign = non-preemptable resource, and deduct non-preemptable resource from total-remained resource (stones is here). All queue will keep in qAlloc In each turn, - All queues not completely satisfied ideal_guaranteed = min(maximum_capacity, used + pending) will NOT be removed. Like what we have today. (water hasn't reached ceiling of the tank) - Get normalized weight of each queue - Get the queue with the minimum {ideal_assigned % guarantee}, say Q_min - Get the target_height is = (Q_min.ideal_assigned + remained * Q_min.normalized_gurantee) - For each queue, do TempQueue.offer like today - The TempQueue.offer method looks like * If (q.ideal_assigned target_height): skip * If (q.ideal_assigned = target_height): accepted = min(q.maximum, q.used + q.pending, target_height * q.guaranteed) - q.ideal_assigned - If accepted becomes zero, remove the queue from qAlloc like today. The loop will exit until total-remained become zero (resource are exhausted) or qAlloc becomes empty (all queue get satisfied). {code} I think this algorithm can get a more balanced result. Does this make sense to you? Thanks, Wangda Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt, YARN-2056.201409210049.txt, YARN-2056.201409232329.txt, YARN-2056.201409242210.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2496) [YARN-796] Changes for capacity scheduler to support allocate resource respect labels
[ https://issues.apache.org/jira/browse/YARN-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2496: - Attachment: YARN-2496.patch Attached new patch against latest trunk [YARN-796] Changes for capacity scheduler to support allocate resource respect labels - Key: YARN-2496 URL: https://issues.apache.org/jira/browse/YARN-2496 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch, YARN-2496.patch This JIRA Includes: - Add/parse labels option to {{capacity-scheduler.xml}} similar to other options of queue like capacity/maximum-capacity, etc. - Include a default-label-expression option in queue config, if an app doesn't specify label-expression, default-label-expression of queue will be used. - Check if labels can be accessed by the queue when submit an app with labels-expression to queue or update ResourceRequest with label-expression - Check labels on NM when trying to allocate ResourceRequest on the NM with label-expression - Respect labels when calculate headroom/user-limit -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14161496#comment-14161496 ] Jian He commented on YARN-1857: --- I found given that {{queueUsedResources = userConsumed}}, we can simplify the formula to {code} min (userlimit - userConsumed, queueMaxCap- queueUsedResources) {code}, does this make sense ? CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Priority: Critical Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.4.patch, YARN-1857.5.patch, YARN-1857.6.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)