[jira] [Resolved] (YARN-667) Data persisted in RM should be versioned
[ https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du resolved YARN-667. - Resolution: Duplicate Data persisted in RM should be versioned Key: YARN-667 URL: https://issues.apache.org/jira/browse/YARN-667 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Assignee: Junping Du Includes data persisted for RM restart, NodeManager directory structure and the Aggregated Log Format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-667) Data persisted in RM should be versioned
[ https://issues.apache.org/jira/browse/YARN-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099907#comment-14099907 ] Junping Du commented on YARN-667: - Agree. Let's address them in separated JIRA when need it in future. For version of RMState, looks like YARN-1239 already address it, so close this as duplicated. Data persisted in RM should be versioned Key: YARN-667 URL: https://issues.apache.org/jira/browse/YARN-667 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Assignee: Junping Du Includes data persisted for RM restart, NodeManager directory structure and the Aggregated Log Format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1198: -- Attachment: YARN-1198.6.patch Fix for findbug finds Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099927#comment-14099927 ] Hadoop QA commented on YARN-1198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662360/YARN-1198.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4650//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4650//console This message is automatically generated. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ram Venkatesh updated YARN-2411: Attachment: YARN-2411.4.patch [~wangda.tan] thank you for your comments. I agree, it is better to check and reject the mapping upfront if it refers to a non-existent or non-leaf queue. Uploading patch with this change. [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099974#comment-14099974 ] Junping Du commented on YARN-2033: -- [~zjshen], thanks for comments above which sounds good to me. I just go through your latest patch, a couple of comments: {code} + public static final String RM_METRICS_PUBLISHER_MULTI_THREADED_DISPATCHER_POOL_SIZE = + RM_PREFIX + metrics-publisher.multi-threaded-dispatcher.pool-size; + public static final int DEFAULT_RM_METRICS_PUBLISHER_MULTI_THREADED_DISPATCHER_POOL_SIZE = + 10; {code} The name of config looks like too long. May be we can rename it to something shorter, i.e. RM_PREFIX + metrics-publisher.dispatcher.pool-size? {code} - optional string diagnostics = 5 [default = N/A]; - optional YarnApplicationAttemptStateProto yarn_application_attempt_state = 6; - optional ContainerIdProto am_container_id = 7; + optional string original_tracking_url = 5; + optional string diagnostics = 6 [default = N/A]; + optional YarnApplicationAttemptStateProto yarn_application_attempt_state = 7; + optional ContainerIdProto am_container_id = 8; {code} We shouldn't insert a new field as it will change the order of existing fields. In PB, the encoded messages only include field type and number and will be map to field name when decoding. Thus, change the field number here will break compatibility which is unnecessary. Add original_tracking_url with field number of 8 should be fine. {code} -if (conf.getBoolean(YarnConfiguration.APPLICATION_HISTORY_ENABLED, - YarnConfiguration.DEFAULT_APPLICATION_HISTORY_ENABLED)) { - historyServiceEnabled = true; +if (conf.get(YarnConfiguration.APPLICATION_HISTORY_STORE) == null + conf.getBoolean(YarnConfiguration.RM_METRICS_PUBLISHER_ENABLED, +YarnConfiguration.DEFAULT_RM_METRICS_PUBLISHER_ENABLED) || +conf.get(YarnConfiguration.APPLICATION_HISTORY_STORE) != null + conf.getBoolean(YarnConfiguration.APPLICATION_HISTORY_ENABLED, +YarnConfiguration.DEFAULT_APPLICATION_HISTORY_ENABLED)) { + yarnMetricsEnabled = true; {code} If user's config is slightly wrong (let's assume: YarnConfiguration.APPLICATION_HISTORY_STORE != null, YarnConfiguration.RM_METRICS_PUBLISHER_ENABLED=true), then here we disable yarnMetricsEnabled sliently which make trouble-shooting effort a little harder. Suggest to log warn messages when user's wrong configuration happens. Better to move logic operations inside of if() to a separated method and log the error for wrong configuration. {code} + property + descriptionThe setting that controls whether yarn metrics is published on +the timeline server or not by RM./description + nameyarn.resourcemanager.metrics-publisher.enabled/name + valuefalse/value + /property {code} Indentation should be 2 white spaces instead of tab. In ApplicationHistoryManagerOnTimelineStore.java, {code} } catch (YarnException e) { + throw new IOException(e); +} {code} This kind of exception translate is unnecessary to me. We can remove it as YarnException get throw here. If we decide to throw IOException only (please see my comments later), we can extend the block to cover more code that could throw YarnException and translate to IOException. The method of convertToApplicationReport seems a little too sophisticated in creating applicationReport. Another option is to wrapper it as Builder pattern (plz refer in MiniDFSCluster) should be better. The same comments on convertToApplicationAttemptReport and convertToContainerReport. This is only optional comments, see if you want to address here or some separate JIRA in future. {code} + public ApplicationAttemptReport getApplicationAttempt( + ApplicationAttemptId appAttemptId) throws YarnException, IOException { +getApplication(appAttemptId.getApplicationId(), ApplicationReportField.NONE); +TimelineEntity entity = null; ... {code} Why do we need getApplication(appAttemptId.getApplicationId(), ApplicationReportField.NONE) here? IMO, the only work here is to check if applicationId is valid, but we have check on appAttemptId later. so we may consider to remove it if unnecessary. In addition, may be ApplicationReportField.NONE is not useful? {code} -new Path(conf.get(YarnConfiguration.FS_APPLICATION_HISTORY_STORE_URI)); +new Path(conf.get(YarnConfiguration.FS_APPLICATION_HISTORY_STORE_URI, +conf.get(hadoop.tmp.dir) + /yarn/timeline/generic-history)); {code} We should replace hadoop.tmp.dir and /yarn/timeline/generic-history with constant string in YarnConfiguration. BTW, hadoop.tmp.dir may not be necessary? In ApplicationContext.java, {code} * @return {@link ApplicationReport} for the ApplicationId. + * @throws YarnException * @throws IOException */ @Public @Unstable - ApplicationReport
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14099976#comment-14099976 ] Hadoop QA commented on YARN-2411: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662363/YARN-2411.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4651//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4651//console This message is automatically generated. [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100029#comment-14100029 ] Jian He commented on YARN-2411: --- test failures should be unrelated to the patch. resubmit same patch with one more assertion of RMApp.getQueue in the test. [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2411: -- Attachment: YARN-2411.5.patch [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100134#comment-14100134 ] Alejandro Abdelnur commented on YARN-2424: -- please refer to yarn-1253 comments, it was stated there that the old behavior had security issues. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100153#comment-14100153 ] Allen Wittenauer commented on YARN-2424: This fix is all about EOU and operability. I can certainly understand the desire to run cgroups without needing local users. But transitioning to security is not a binary process for most users (or, at least, it doesn't have to be...) The problem with the current code base is that someone moving to a secure mode now has to either enable cgroups (which, as pointed out in YARN-1253 is irrelevant for security) or cut everything over at once. Enabling LCE prior to enabling security allows for a two step transition and eases problem determination when doing the security upgrade. Is that user missing from the system or is Kerberos failing? Clearly the issues stemming from the former can be sorted out without security. This makes the operations side of the house much easier. It's also worth pointing out that one of the key benefits of running tasks as the user who submitted them is that it makes troubleshooting much easier. When one hops on a node, it is evident as to which user's tasks one is looking at it, even if those tasks aren't validated as that user. This is especially important in heavy multi-tenant scenarios. But, again, the fix in YARN-1253 caused a regression. LCE w/out security was supported prior to Hadoop 2.3 and was definitely used by people.This change still sets the default to be LCE w/either one user or security, but now for folks who want the prior behavior, they can flip a flag and get it. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100158#comment-14100158 ] Alejandro Abdelnur commented on YARN-2424: -- please go over todd's comment over the security issues on sudoing as user without secure auth. definitely you don't want to do that in a multi-tenant cluster. btw, fixing a security bug is not a regression. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100163#comment-14100163 ] Allen Wittenauer commented on YARN-2424: I don't think you understood what I wrote. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-1198: -- Attachment: YARN-1198.7.patch Can't repro test failure - uploading (better formatted) patch to trigger new build Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100171#comment-14100171 ] Hadoop QA commented on YARN-2424: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662313/YARN-2424.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4654//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4654//console This message is automatically generated. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100190#comment-14100190 ] Hadoop QA commented on YARN-1198: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662403/YARN-1198.7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4655//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4655//console This message is automatically generated. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1198) Capacity Scheduler headroom calculation does not work as expected
[ https://issues.apache.org/jira/browse/YARN-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100204#comment-14100204 ] Chen He commented on YARN-1198: --- Thank you for the update, [~cwelch]. Capacity Scheduler headroom calculation does not work as expected - Key: YARN-1198 URL: https://issues.apache.org/jira/browse/YARN-1198 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Craig Welch Attachments: YARN-1198.1.patch, YARN-1198.2.patch, YARN-1198.3.patch, YARN-1198.4.patch, YARN-1198.5.patch, YARN-1198.6.patch, YARN-1198.7.patch Today headroom calculation (for the app) takes place only when * New node is added/removed from the cluster * New container is getting assigned to the application. However there are potentially lot of situations which are not considered for this calculation * If a container finishes then headroom for that application will change and should be notified to the AM accordingly. * If a single user has submitted multiple applications (app1 and app2) to the same queue then ** If app1's container finishes then not only app1's but also app2's AM should be notified about the change in headroom. ** Similarly if a container is assigned to any applications app1/app2 then both AM should be notified about their headroom. ** To simplify the whole communication process it is ideal to keep headroom per User per LeafQueue so that everyone gets the same picture (apps belonging to same user and submitted in same queue). * If a new user submits an application to the queue then all applications submitted by all users in that queue should be notified of the headroom change. * Also today headroom is an absolute number ( I think it should be normalized but then this is going to be not backward compatible..) * Also when admin user refreshes queue headroom has to be updated. These all are the potential bugs in headroom calculations -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100253#comment-14100253 ] Jian He commented on YARN-415: -- bq. an attempt can be in the complete state before all of its containers are finished CapacityScheduler#doneApplicationAttempt(FairScheduler#removeApplicationAttempt) are synchronously finishing all the live containers. so I think all containers should be guaranteed to finish before attempt finishes. bq. charging the running containers to the current app until the containers finish will be seemless to the end user. Particularly in work-preserving AM restart, current AM is actually the one who's managing previous running containers. Running containers in scheduler are already transferred to the current AM. So running containers metrics are transferred as well. I think it'll be confusing if finished containers are still charged back against the previous dead attempt. Btw, YARN-1809 will add the attempt web page where we could show attempt-specific metrics also. Regarding the problem of metrics persistency. Agree that it doesn't solve the problem for running apps in general. Maybe we can have the state store changes in a separate jira and discuss more there, so that we can get this in first. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100278#comment-14100278 ] Hadoop QA commented on YARN-2411: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662375/YARN-2411.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4657//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4657//console This message is automatically generated. [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100302#comment-14100302 ] Alejandro Abdelnur commented on YARN-2424: -- I think I did, if I'm reading correctly you are stating that is better for troubleshooting, specially in multi-tenant scenarios: bq. It's also worth pointing out that one of the key benefits of running tasks as the user who submitted them is that it makes troubleshooting much easier. When one hops on a node, it is evident as to which user's tasks one is looking at it, even if those tasks aren't validated as that user. This is especially important in heavy multi-tenant scenarios. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2077) JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs
[ https://issues.apache.org/jira/browse/YARN-2077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2077: - Affects Version/s: 2.5.0 JobImpl#makeUberDecision doesn't log that Uber mode is disabled because of too much CPUs Key: YARN-2077 URL: https://issues.apache.org/jira/browse/YARN-2077 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.4.0, 2.5.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Trivial Attachments: YARN-2077.1.patch JobImpl#makeUberDecision usually logs why the Job cannot be launched as Uber mode(e.g. too much RAM; or something). About CPUs, it's not logged currently. We should log it when too much CPU. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1919: - Affects Version/s: 2.5.0 Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE -- Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)