[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100316#comment-14100316 ] Wangda Tan commented on YARN-2411: -- Ram, Thanks for updating, LGTM, +1. Wangda [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1919: - Attachment: YARN-1919.2.patch Refreshed a patch on trunk. Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE -- Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100354#comment-14100354 ] Hudson commented on YARN-2411: -- FAILURE: Integrated in Hadoop-trunk-Commit #6084 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6084/]) YARN-2411. Support simple user and group mappings to queues. Contributed by Ram Venkatesh (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Fix For: 2.6.0 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100356#comment-14100356 ] Zhijie Shen commented on YARN-2033: --- [~djp], thanks for your comments. I addressed most of your comments in the new patch, and fix one bug I've found locally. Bellow are some response w.r.t to your concerns. bq. Why do we need getApplication(appAttemptId.getApplicationId(), ApplicationReportField.NONE) here? Because I want to check whether the application exists in the timeline store or not, before retrieving the application attempt information. If the application doesn't exist, we need to throw ApplicationNotFoundException. BTW, in YARN-1250, getting app is going to be required for each API, because we need to check whether the user has access to this application or not. bq. If user's config is slightly wrong (let's assume: YarnConfiguration.APPLICATION_HISTORY_STORE != null, YarnConfiguration.RM_METRICS_PUBLISHER_ENABLED=true), then here we disable yarnMetricsEnabled sliently which make trouble-shooting effort a little harder. Suggest to log warn messages when user's wrong configuration happens. Better to move logic operations inside of if() to a separated method and log the error for wrong configuration. I rethink about the backward compatibility, and I think it's not good to reply on checking APPLICATION_HISTORY_STORE, because its default is already the FS-based history store. The users may use this store without explicitly setting it in their config file. Instead, I think it's more reasonable to check APPLICATION_HISTORY_ENABLED to determine whether the user is using old history store, because it is false by default. Unless the user sets it explicitly in the config file, he's not able to use the old history store. Therefore I changed the logic in YarnClientImpl, ApplicationHistoryServer, YarnMetricsPublisher to reply on APPLICATION_HISTORY_ENABLED for backward compatibility. Per the suggestion, if if the old history service stack is used, a warn level log will be recorded. In addition, when APPLICATION_HISTORY_ENABLED = true, YarnMetricsPublisher cannot be enabled, preventing RMApplicationHistoryWriter and YarnMetricsPublisher writing the application history simultaneously. bq. The method of convertToApplicationReport seems a little too sophisticated in creating applicationReport. Another option is to wrapper it as Builder pattern (plz refer in MiniDFSCluster) should be better. I agree the builder model should be more decent, but it seems that it needs to change Report classes, which currently use newInstance to construct the instance. Let's file a separate Jira to deal with building a big record with quite a few fields. bq. We should replace hadoop.tmp.dir and /yarn/timeline/generic-history with constant string in YarnConfiguration. BTW, hadoop.tmp.dir may not be necessary? This is because conf.get(hadoop.tmp.dir) cannot be determined in advance. bq. For public API (although marked as unstable), adding a new exception will break compatibility of RPC as old version client don't know how to deal with new exception. ApplicationContext is actually not an RPC interface, but is used internally in the server daemons. We previously refactored the code and created such common interface for RM and GHS to source the application/attempt/container report(s) (although RM still pulls the information from RMContext directly, such that we could use the same CLI/webUI/service, but hook on different data source. Anyway, the annotations here are misleading, such that I deleted them. bq. I am not sure if this change (and other changes in this class) is necessary. If not, we can remove it. I did this intentionally. In fact, I wanted to discard {code} protected int allocatedMB; protected int allocatedVCores; {code} Because history information doesn't include the runtime resource usage information. If we keep the two fields here, in the web services output, we will always see allocatedMB=0, and allocatedVCores=0. bq. We already have the same implementation of MultiThreadedDispatcher in RMApplicationHistoryWriter.java. That's right. Again it's duplicated by purpose. After this patch, I'm going to deprecate the classes of the old generic history read/write layer, including RMApplicationHistoryWriter (YARN-2320), such that in the next big release (e.g. Hadoop 3.0), we can remove the deprecated code. MultiThreadedDispatcher should be the sub-component of YarnMetricsPublisher unless it is going be used by other components. It it happens, we can promote it to the first-citizen class. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue
[jira] [Updated] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2033: -- Attachment: YARN-2033.6.patch Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1514: - Attachment: YARN-1514.3.patch Refreshed the v2 patch on trunk. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-1348) Batching optimization for ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA resolved YARN-1348. -- Resolution: Fixed Batching optimization for ZKRMStateStore Key: YARN-1348 URL: https://issues.apache.org/jira/browse/YARN-1348 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha We rethought znodes structure on YARN-1307. We can optimize to reduce znodes about DelegationKey and DelegationToken by using batching store them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1326: - Affects Version/s: 2.5.0 RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.5.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1753) FileSystemApplicationHistoryStore#HistoryFileReader#next() should check return value of dis.read()
[ https://issues.apache.org/jira/browse/YARN-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-1753: -- Issue Type: Sub-task (was: Bug) Parent: YARN-321 FileSystemApplicationHistoryStore#HistoryFileReader#next() should check return value of dis.read() -- Key: YARN-1753 URL: https://issues.apache.org/jira/browse/YARN-1753 Project: Hadoop YARN Issue Type: Sub-task Reporter: Ted Yu Priority: Minor Attachments: YARN-1753.patch Here is related code: {code} byte[] value = new byte[entry.getValueLength()]; dis.read(value); {code} entry.getValueLength() bytes are expected to be read. The return value from dis.read() should be checked against value length. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1348) Batching optimization for ZKRMStateStore
[ https://issues.apache.org/jira/browse/YARN-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100394#comment-14100394 ] Tsuyoshi OZAWA commented on YARN-1348: -- This ticket looks deprecated and has been implemented already. Close this as resolved. Batching optimization for ZKRMStateStore Key: YARN-1348 URL: https://issues.apache.org/jira/browse/YARN-1348 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Labels: ha We rethought znodes structure on YARN-1307. We can optimize to reduce znodes about DelegationKey and DelegationToken by using batching store them. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1753) FileSystemApplicationHistoryStore#HistoryFileReader#next() should check return value of dis.read()
[ https://issues.apache.org/jira/browse/YARN-1753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100404#comment-14100404 ] Hadoop QA commented on YARN-1753: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662438/YARN-1753.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4659//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4659//console This message is automatically generated. FileSystemApplicationHistoryStore#HistoryFileReader#next() should check return value of dis.read() -- Key: YARN-1753 URL: https://issues.apache.org/jira/browse/YARN-1753 Project: Hadoop YARN Issue Type: Sub-task Reporter: Ted Yu Priority: Minor Attachments: YARN-1753.patch Here is related code: {code} byte[] value = new byte[entry.getValueLength()]; dis.read(value); {code} entry.getValueLength() bytes are expected to be read. The return value from dis.read() should be checked against value length. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100406#comment-14100406 ] Hadoop QA commented on YARN-1919: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662434/YARN-1919.2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4658//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4658//console This message is automatically generated. Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE -- Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100442#comment-14100442 ] Hadoop QA commented on YARN-1514: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662441/YARN-1514.3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4661//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4661//console This message is automatically generated. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1514: - Attachment: YARN-1514.4.patch Forgot to add YarnTestDriver.java. This patch includes it. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1326) RM should log using RMStore at startup time
[ https://issues.apache.org/jira/browse/YARN-1326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100444#comment-14100444 ] Tsuyoshi OZAWA commented on YARN-1326: -- Thanks for your review, Karthik and Vinod. I'll update it. RM should log using RMStore at startup time --- Key: YARN-1326 URL: https://issues.apache.org/jira/browse/YARN-1326 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.5.0 Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-1326.1.patch Original Estimate: 3h Remaining Estimate: 3h Currently there are no way to know which RMStore RM uses. It's useful to log the information at RM's startup time. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100450#comment-14100450 ] Hadoop QA commented on YARN-2033: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662437/YARN-2033.6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 17 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.client.TestResourceTrackerOnHA org.apache.hadoop.yarn.client.TestApplicationMasterServiceOnHA org.apache.hadoop.yarn.client.TestRMFailover org.apache.hadoop.yarn.client.TestApplicationClientProtocolOnHA org.apache.hadoop.yarn.server.resourcemanager.TestRMEmbeddedElector org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStoreZKClientConnections {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4660//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4660//console This message is automatically generated. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1514: - Attachment: YARN-1514.4.patch Fixed the test failure. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100499#comment-14100499 ] Hadoop QA commented on YARN-1514: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662456/YARN-1514.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStorePerf org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestCapacityScheduler {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4662//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4662//console This message is automatically generated. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100526#comment-14100526 ] Hudson commented on YARN-2411: -- FAILURE: Integrated in Hadoop-Yarn-trunk #650 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/650/]) YARN-2411. Support simple user and group mappings to queues. Contributed by Ram Venkatesh (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Fix For: 2.6.0 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2425) When Application submitted by via Yarn RM WS, log aggregation does not happens
Karam Singh created YARN-2425: - Summary: When Application submitted by via Yarn RM WS, log aggregation does not happens Key: YARN-2425 URL: https://issues.apache.org/jira/browse/YARN-2425 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.0, 2.6.0 Environment: Secure (Kerberos enabled) hadoop cluster. With SPNEGO for Yarn RM enabled Reporter: Karam Singh When submit App to Yarn RM using Web service we need to pass credentials/tokens in json object/xml object to webservice As HDFS namenode does not provides any DT over WS (base64 encoded) like webhdfs/timeline server does. (HDFS fetch dt commad fetch java writable object and writes it to target file, we we cannot forward via application Submission WS objects) Looks like there is not way to pass HDFS token to NodeManager. While starting Application container also tries to create Application log aggregation dir and fails with following type exception {code} java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: hostname/ip; destination host is: NameNodeHost:FSPort; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1415) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:725) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy35.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1781) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1069) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1065) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1065) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:240) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:64) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:253) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:344) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:310) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:421) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:64) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at org.apache.hadoop.ipc.Client$Connection$1.run(Client.java:679) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100593#comment-14100593 ] Junping Du commented on YARN-160: - Thanks [~vvasudev] for working on this. Just take a quick glance, a few comments: - The old way to configure resource of NM is still useful, especially when there are other agents running (like: HBase RegionServer). Thus, user need flexibility to calculate resource themselves in some cases, so we should provide another new option instead of removing old way completely. - Given this is a new feature, we shouldn't change cluster's behavior with old configuration in upgrade prospective. We should keep previous configuration work as usual especially when user use some default settings. nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-160.0.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2426) NodeManger is not able use WebHDFS token properly to tallk to WebHDFS while localizing
Karam Singh created YARN-2426: - Summary: NodeManger is not able use WebHDFS token properly to tallk to WebHDFS while localizing Key: YARN-2426 URL: https://issues.apache.org/jira/browse/YARN-2426 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 2.6.0 Environment: Hadoop Keberos (Secure) cluster with LinuxContainerExcutor is enabled With SPNEGO on for Yarn new RM web services for application submission While using kinit we are using -C (to specify cachepath). Then while executing set export KRB5CCNAME = path provided with -C option There is no kerberos ticket in default KRB5 cache path with is /tmp Reporter: Karam Singh Encountered this issue during using new YARN's RM WS for application submission, on single node cluster while submitting Distributed Shell application using RM WS(webservice). For this we need pass custom script and AppMaster jar along with webhdfs token to NodeManager for localization. Distributed Shell Application was failing as Node was failing to localise AppMaster jar . Following is the NM log while localizing AppMaster jar: {code} 2014-08-18 01:53:52,434 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(114)) - Authorization successful for testing (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 2014-08-18 01:53:52,757 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARpPATH, 1408352019488, FILE, null }, Authentication required 2014-08-18 01:53:52,758 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARPATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408351986532_0001/filecache/10/DshellAppMaster.jar) transitioned from DOWNLOADING to FAILED 2014-08-18 01:53:52,758 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1408351986532_0001_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED {code} Which is similar to what we get is when we try access webhdfs in secure (kerberos) cluster without doing kinit Whereas if we do curl -i -k -s 'http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH?op=listStatusdelegation=same webhdfs token used in app submission structure works properly I also tried using http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/hadoopqa/JAR_PATH in app submission object instead of webhdfs:// uri format Then NodeManger fail to localize as there is http filesystem scheme {code} 14-08-18 02:03:31,343 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(114)) - Authorization successful for testing (auth:TOKEN) for protocol=interface org.apache. hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 2014-08-18 02:03:31,583 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH 1408352576841, FILE, null }, No FileSystem for scheme: http 2014-08-18 02:03:31,583 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408352544163_0002/filecache/11/DshellAppMaster.jar) transitioned from DOWNLOADING to FAILED {code} Now do kinit without providing -C option for KRB5 cache path. So Ticket to goes to default KRB5 cache /tmp Again submit same application object to Yarn WS, with webhdfs:// uri format paths and webhdfs token This time NM is able download jar and custom shell script and application runs fine Looks like following is happening: webhdfs is trying look for krb ticket in NM while localising 1. As 1st case there was to krb ticket there in default cache. Application failing while localising AppMaster jar 2. In second case as already kinit and krb ticket was present in /tmp (default KRB5 cache). AppMaster got localized successfully -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100621#comment-14100621 ] Varun Vasudev commented on YARN-160: [~djp] {quote} The old way to configure resource of NM is still useful, especially when there are other agents running (like: HBase RegionServer). Thus, user need flexibility to calculate resource themselves in some cases, so we should provide another new option instead of removing old way completely. {quote} The patch supports the old way. If a user has set values for memory and vcores, they're used without looking at the underlying hardware. I've added test cases to verify that behaviour as well. Have I missed a use case? {quote} Given this is a new feature, we shouldn't change cluster's behavior with old configuration in upgrade prospective. We should keep previous configuration work as usual especially when user use some default settings. {quote} There are two scenarios here - 1. A configuration file with custom settings for memory and cpu - nothing will change for these users. 2. A configuration file with no settings for memory and cpu - in this case, the memory and cpu resources will be calculated based on the underlying hardware instead of them being set to 8192 and 8 respectively. Isn't calculating the values from the hardware a better option? If people feel strongly about sticking to 8192 and 8, I don't have any problems changing them but it seems a bit odd. nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-160.0.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100623#comment-14100623 ] Junping Du commented on YARN-2033: -- bq. Because I want to check whether the application exists in the timeline store or not, before retrieving the application attempt information. If the application doesn't exist, we need to throw ApplicationNotFoundException. IMO, this is not necessary as application should exist in most cases and we don't need to duplicated visit LevelDB twice. If application doesn't exist, we can throw ApplicationNotFoundException in retrieving app attempt info. Isn't it? bq. I rethink about the backward compatibility, and I think it's not good to reply on checking APPLICATION_HISTORY_STORE, because its default is already the FS-based history store. The users may use this store without explicitly setting it in their config file. Instead, I think it's more reasonable to check APPLICATION_HISTORY_ENABLED to determine whether the user is using old history store, because it is false by default. Backward compatibility is only one concern I had. Another concern here is on usability of these (old and new) configurations. I just list one possible wrong configuration above but didn't want to judge which wrong configuration is more likely to happen. The point here is we should check on the combination of related configurations and make all wrong combinations get warned. Any concern on doing this? bq. This is because conf.get(hadoop.tmp.dir) cannot be determined in advance. I mean to define hadoop.tmp.dir in YarnConfiguration to be something like: HADOOP_TMP_DIR which sounds more uniform when dealing with config. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2425) When Application submitted by via Yarn RM WS, log aggregation does not happens
[ https://issues.apache.org/jira/browse/YARN-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev reassigned YARN-2425: --- Assignee: Varun Vasudev When Application submitted by via Yarn RM WS, log aggregation does not happens -- Key: YARN-2425 URL: https://issues.apache.org/jira/browse/YARN-2425 Project: Hadoop YARN Issue Type: Bug Components: log-aggregation Affects Versions: 2.5.0, 2.6.0 Environment: Secure (Kerberos enabled) hadoop cluster. With SPNEGO for Yarn RM enabled Reporter: Karam Singh Assignee: Varun Vasudev When submit App to Yarn RM using Web service we need to pass credentials/tokens in json object/xml object to webservice As HDFS namenode does not provides any DT over WS (base64 encoded) like webhdfs/timeline server does. (HDFS fetch dt commad fetch java writable object and writes it to target file, we we cannot forward via application Submission WS objects) Looks like there is not way to pass HDFS token to NodeManager. While starting Application container also tries to create Application log aggregation dir and fails with following type exception {code} java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: hostname/ip; destination host is: NameNodeHost:FSPort; at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:764) at org.apache.hadoop.ipc.Client.call(Client.java:1415) at org.apache.hadoop.ipc.Client.call(Client.java:1364) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) at com.sun.proxy.$Proxy34.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:725) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy35.getFileInfo(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1781) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1069) at org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1065) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1065) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.checkExists(LogAggregationService.java:240) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.access$100(LogAggregationService.java:64) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1614) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.createAppDir(LogAggregationService.java:253) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initAppAggregator(LogAggregationService.java:344) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.initApp(LogAggregationService.java:310) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:421) at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService.handle(LogAggregationService.java:64) at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS] at
[jira] [Assigned] (YARN-2426) NodeManger is not able use WebHDFS token properly to tallk to WebHDFS while localizing
[ https://issues.apache.org/jira/browse/YARN-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Vasudev reassigned YARN-2426: --- Assignee: Varun Vasudev NodeManger is not able use WebHDFS token properly to tallk to WebHDFS while localizing --- Key: YARN-2426 URL: https://issues.apache.org/jira/browse/YARN-2426 Project: Hadoop YARN Issue Type: Bug Components: nodemanager, resourcemanager, webapp Affects Versions: 2.6.0 Environment: Hadoop Keberos (Secure) cluster with LinuxContainerExcutor is enabled With SPNEGO on for Yarn new RM web services for application submission While using kinit we are using -C (to specify cachepath). Then while executing set export KRB5CCNAME = path provided with -C option There is no kerberos ticket in default KRB5 cache path with is /tmp Reporter: Karam Singh Assignee: Varun Vasudev Encountered this issue during using new YARN's RM WS for application submission, on single node cluster while submitting Distributed Shell application using RM WS(webservice). For this we need pass custom script and AppMaster jar along with webhdfs token to NodeManager for localization. Distributed Shell Application was failing as Node was failing to localise AppMaster jar . Following is the NM log while localizing AppMaster jar: {code} 2014-08-18 01:53:52,434 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(114)) - Authorization successful for testing (auth:TOKEN) for protocol=interface org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 2014-08-18 01:53:52,757 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARpPATH, 1408352019488, FILE, null }, Authentication required 2014-08-18 01:53:52,758 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource webhdfs://NAMENODEHOST:NAMENODEHTTPPORT/user/JARPATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408351986532_0001/filecache/10/DshellAppMaster.jar) transitioned from DOWNLOADING to FAILED 2014-08-18 01:53:52,758 INFO container.Container (ContainerImpl.java:handle(999)) - Container container_1408351986532_0001_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED {code} Which is similar to what we get is when we try access webhdfs in secure (kerberos) cluster without doing kinit Whereas if we do curl -i -k -s 'http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH?op=listStatusdelegation=same webhdfs token used in app submission structure works properly I also tried using http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/hadoopqa/JAR_PATH in app submission object instead of webhdfs:// uri format Then NodeManger fail to localize as there is http filesystem scheme {code} 14-08-18 02:03:31,343 INFO authorize.ServiceAuthorizationManager (ServiceAuthorizationManager.java:authorize(114)) - Authorization successful for testing (auth:TOKEN) for protocol=interface org.apache. hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB 2014-08-18 02:03:31,583 INFO localizer.ResourceLocalizationService (ResourceLocalizationService.java:update(1011)) - DEBUG: FAILED { http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH 1408352576841, FILE, null }, No FileSystem for scheme: http 2014-08-18 02:03:31,583 INFO localizer.LocalizedResource (LocalizedResource.java:handle(203)) - Resource http://NAMENODEHOST:NAMENODEHTTPPORT/webhdfs/v1/user/JAR_PATH(-NM_LOCAL_DIR/usercache/APP_USER/appcache/application_1408352544163_0002/filecache/11/DshellAppMaster.jar) transitioned from DOWNLOADING to FAILED {code} Now do kinit without providing -C option for KRB5 cache path. So Ticket to goes to default KRB5 cache /tmp Again submit same application object to Yarn WS, with webhdfs:// uri format paths and webhdfs token This time NM is able download jar and custom shell script and application runs fine Looks like following is happening: webhdfs is trying look for krb ticket in NM while localising 1. As 1st case there was to krb ticket there in default cache. Application failing while localising AppMaster jar 2. In second case as already kinit and krb ticket was present in /tmp (default KRB5 cache). AppMaster got localized successfully -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2421) CapacityScheduler still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chang li reassigned YARN-2421: -- Assignee: chang li CapacityScheduler still allocates containers to an app in the FINISHING state - Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.1 Reporter: Thomas Graves Assignee: chang li I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100671#comment-14100671 ] Hadoop QA commented on YARN-160: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662475/apache-yarn-160.0.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-gridmix hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4664//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4664//console This message is automatically generated. nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-160.0.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100680#comment-14100680 ] Junping Du commented on YARN-160: - bq. The patch supports the old way. Thanks for clarification here. Yes. I saw the details of getYARNContainerMemoryMB() which sounds to honor previous NM resource configuration. bq. Isn't calculating the values from the hardware a better option? Agree. But if the calculating results is not reasonable (like 0 or minus value), shall we use previous NM default value instead? At least, experienced users (especially with test purpose) already had some expectations even when they don't set any resource value here. nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Varun Vasudev Fix For: 2.6.0 Attachments: apache-yarn-160.0.patch As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2390) Investigating whehther generic history service needs to support queue-acls
[ https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100724#comment-14100724 ] Sunil G commented on YARN-2390: --- Hi [~zjshen] bq. is the right fix to be correcting the ACLs on RM side? +1. Yes, I also feel it will be better if we remove the ACL checks for those apps which are completed from RM side. If the rmApp state is not *FinalApplicationStatus.UNDEFINED*, such applications must have been moved to FAILED/SUCCEEDED/KILLED. queue ACLs for such applications need not have to be checked. *ClientRMService#checkAccess* can be modified with this change. If this approach is fine, I would like to take over this JIRA. Kindly let me know your suggestion. Investigating whehther generic history service needs to support queue-acls -- Key: YARN-2390 URL: https://issues.apache.org/jira/browse/YARN-2390 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen According YARN-1250, it's arguable whether queue-acls should be applied to the generic history service as well, because the queue admin may not need the access to the completed application that is removed from the queue. Create this ticket to tackle the discussion around. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100749#comment-14100749 ] Karthik Kambatla commented on YARN-415: --- [~eepayne] - Sorry again for coming in so late. I am not completely sure myself (yet) how we can use the timeline server or if it makes sense to do that. I guess I need to first understand what we are trying to accomplish here. Could you please correct me/comment on the following items. # The goal is to capture memory utilization at the app-level for chargeback. I like the goal, but would like to understand the usecases we have in mind. Is the chargeback simply to track the usage and may be financially charge the users. Or, is to influence future scheduling decisions? I agree that the RM should facilitate collecting this information, but should the collected info be available to the RM for future use? If not, do we want the RM to serve this info? # Do we want to charge the app only for the resources used to do meaningful work or do we also want to include failed/preempted containers? If we don't charge the app for failed containers, who are they charged to? Are we okay with letting some resources go uncharged? # How soon do we want this usage information? It might make sense to collect/expose this once the app is finished for certain kinds of applications. What is our story for long-running applications? As Jian suggested, I would be up for getting in those parts that we are clear about and file follow-up JIRAs for those that need more discussion. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100763#comment-14100763 ] Hudson commented on YARN-2411: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1841 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1841/]) YARN-2411. Support simple user and group mappings to queues. Contributed by Ram Venkatesh (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Fix For: 2.6.0 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2427) Add support for moving apps between queues in RM web services
Varun Vasudev created YARN-2427: --- Summary: Add support for moving apps between queues in RM web services Key: YARN-2427 URL: https://issues.apache.org/jira/browse/YARN-2427 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Reporter: Varun Vasudev Assignee: Varun Vasudev Support for moving apps from one queue to another is now present in CapacityScheduler and FairScheduler. We should expose the functionality via RM web services as well. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2411) [Capacity Scheduler] support simple user and group mappings to queues
[ https://issues.apache.org/jira/browse/YARN-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100822#comment-14100822 ] Hudson commented on YARN-2411: -- SUCCESS: Integrated in Hadoop-Mapreduce-trunk #1867 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1867/]) YARN-2411. Support simple user and group mappings to queues. Contributed by Ram Venkatesh (jianhe: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1618542) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/conf/capacity-scheduler.xml * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacitySchedulerConfiguration.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestQueueMappings.java [Capacity Scheduler] support simple user and group mappings to queues - Key: YARN-2411 URL: https://issues.apache.org/jira/browse/YARN-2411 Project: Hadoop YARN Issue Type: Improvement Components: capacityscheduler Reporter: Ram Venkatesh Assignee: Ram Venkatesh Fix For: 2.6.0 Attachments: YARN-2411-2.patch, YARN-2411.1.patch, YARN-2411.3.patch, YARN-2411.4.patch, YARN-2411.5.patch YARN-2257 has a proposal to extend and share the queue placement rules for the fair scheduler and the capacity scheduler. This is a good long term solution to streamline queue placement of both schedulers but it has core infra work that has to happen first and might require changes to current features in all schedulers along with corresponding configuration changes, if any. I would like to propose a change with a smaller scope in the capacity scheduler that addresses the core use cases for implicitly mapping jobs that have the default queue or no queue specified to specific queues based on the submitting user and user groups. It will be useful in a number of real-world scenarios and can be migrated over to the unified scheme when YARN-2257 becomes available. The proposal is to add two new configuration options: yarn.scheduler.capacity.queue-mappings-override.enable A boolean that controls if user-specified queues can be overridden by the mapping, default is false. and, yarn.scheduler.capacity.queue-mappings A string that specifies a list of mappings in the following format (default is which is the same as no mapping) map_specifier:source_attribute:queue_name[,map_specifier:source_attribute:queue_name]* map_specifier := user (u) | group (g) source_attribute := user | group | %user queue_name := the name of the mapped queue | %user | %primary_group The mappings will be evaluated left to right, and the first valid mapping will be used. If the mapped queue does not exist, or the current user does not have permissions to submit jobs to the mapped queue, the submission will fail. Example usages: 1. user1 is mapped to queue1, group1 is mapped to queue2 u:user1:queue1,g:group1:queue2 2. To map users to queues with the same name as the user: u:%user:%user I am happy to volunteer to take this up. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2033) Investigate merging generic-history into the Timeline Store
[ https://issues.apache.org/jira/browse/YARN-2033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100873#comment-14100873 ] Zhijie Shen commented on YARN-2033: --- The test failures seem to be related to RM HA. bq. IMO, this is not necessary as application should exist in most cases and we don't need to duplicated visit LevelDB twice. If application doesn't exist, we can throw ApplicationNotFoundException in retrieving app attempt info. Isn't it? First of all, two queries are not duplicate: one to read the application entity, and the other to read the app attempt entity, and we previously distinguish ApplicationNotFoundException and ApplicationAttemptNotFoundException. It is always possible that App1 exists in the store with the only attempt AppAttempt1 while the user looks up for AppAttempt2. In this case, we know App1 is there, but AppAttempt2 isn't, so we will throw ApplicationAttemptNotFoundException. Moreover, when we go on with generic history ACLs, we will anyway visit the app entity once to pull the user info for access check. bq. The point here is we should check on the combination of related configurations and make all wrong combinations get warned. Any concern on doing this? Right, so in the new patch, I've enhanced the configuration check logic, to make sure either the old or the new history service stack will be used, but not both. However, I don't cover mis config of within the scope of the old history service stack itself. For example, ApplicationHistoryStore - null store while enabling history service. It even didn't work in the previous situation. bq. I mean to define hadoop.tmp.dir in YarnConfiguration to be something like: HADOOP_TMP_DIR which sounds more uniform when dealing with config. hadoop.tmp.dir should be part of YarnConfiguration. If it really needs to be added, it should be placed in CommonConfigurationKeys. However, I'm afraid it's not a good idea to do that, either. Let's look into the default of it. {code} property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property {code} The default comes with a param, which can not be determined upfront either. AFAIK, all such kind of defaults are not contained in config classes. Investigate merging generic-history into the Timeline Store --- Key: YARN-2033 URL: https://issues.apache.org/jira/browse/YARN-2033 Project: Hadoop YARN Issue Type: Sub-task Reporter: Vinod Kumar Vavilapalli Assignee: Zhijie Shen Attachments: ProposalofStoringYARNMetricsintotheTimelineStore.pdf, YARN-2033.1.patch, YARN-2033.2.patch, YARN-2033.3.patch, YARN-2033.4.patch, YARN-2033.5.patch, YARN-2033.6.patch, YARN-2033.Prototype.patch, YARN-2033_ALL.1.patch, YARN-2033_ALL.2.patch, YARN-2033_ALL.3.patch, YARN-2033_ALL.4.patch Having two different stores isn't amicable to generic insights on what's happening with applications. This is to investigate porting generic-history into the Timeline Store. One goal is to try and retain most of the client side interfaces as close to what we have today. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2310) Revisit the APIs in RM web services where user information can make difference
[ https://issues.apache.org/jira/browse/YARN-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100874#comment-14100874 ] Sunil G commented on YARN-2310: --- YARN-1867 has added queue ACL checks, and hasAccess is already invoked by getApp and getApps api's. If queue ACL access is available, then information of an application such as *start/finished/elapsed time* and *AM container information* will be filled in to AppInfo object. Do you mean some more extra information is taken from customized yarn filter added in YARN-2247, could you please help to give some more insight. Revisit the APIs in RM web services where user information can make difference -- Key: YARN-2310 URL: https://issues.apache.org/jira/browse/YARN-2310 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen After YARN-2247, RM web services can be sheltered by the authentication filter, which can help to identify who the user is. With this information, we should be able to fix the security problem of some existing APIs, such as getApp, getAppAttempts, getApps. We should use the user information to check the ACLs before returning the requested data to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2390) Investigating whehther generic history service needs to support queue-acls
[ https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100880#comment-14100880 ] Zhijie Shen commented on YARN-2390: --- [~sunilg], please feel free to assign the ticket to youself. bq. If the rmApp state is not FinalApplicationStatus.UNDEFINED, Is this check necessary? The application can do unregistration without specifying FinalApplicationStatus. I'm not sure whether RM will conclude a FinalApplicationStatus on behalf of the app. Investigating whehther generic history service needs to support queue-acls -- Key: YARN-2390 URL: https://issues.apache.org/jira/browse/YARN-2390 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen According YARN-1250, it's arguable whether queue-acls should be applied to the generic history service as well, because the queue admin may not need the access to the completed application that is removed from the queue. Create this ticket to tackle the discussion around. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1514: - Attachment: YARN-1514.5.patch Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2310) Revisit the APIs in RM web services where user information can make difference
[ https://issues.apache.org/jira/browse/YARN-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100883#comment-14100883 ] Zhijie Shen commented on YARN-2310: --- Thanks for notifying me of that. Would you please check the other app-related getter methods? For example, getAppAttempts. It seems that we can access without any access control. Revisit the APIs in RM web services where user information can make difference -- Key: YARN-2310 URL: https://issues.apache.org/jira/browse/YARN-2310 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen After YARN-2247, RM web services can be sheltered by the authentication filter, which can help to identify who the user is. With this information, we should be able to fix the security problem of some existing APIs, such as getApp, getAppAttempts, getApps. We should use the user information to check the ACLs before returning the requested data to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2190) Provide a Windows container executor that can limit memory and CPU
[ https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuan Liu updated YARN-2190: Attachment: YARN-2190.4.patch Attach a new patch to address the audit warning. Provide a Windows container executor that can limit memory and CPU -- Key: YARN-2190 URL: https://issues.apache.org/jira/browse/YARN-2190 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Chuan Liu Assignee: Chuan Liu Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch Yarn default container executor on Windows does not set the resource limit on the containers currently. The memory limit is enforced by a separate monitoring thread. The container implementation on Windows uses Job Object right now. The latest Windows (8 or later) API allows CPU and memory limits on the job objects. We want to create a Windows container executor that sets the limits on job objects thus provides resource enforcement at OS level. http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2390) Investigating whehther generic history service needs to support queue-acls
[ https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunil G reassigned YARN-2390: - Assignee: Sunil G Investigating whehther generic history service needs to support queue-acls -- Key: YARN-2390 URL: https://issues.apache.org/jira/browse/YARN-2390 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Sunil G According YARN-1250, it's arguable whether queue-acls should be applied to the generic history service as well, because the queue admin may not need the access to the completed application that is removed from the queue. Create this ticket to tackle the discussion around. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2390) Investigating whehther generic history service needs to support queue-acls
[ https://issues.apache.org/jira/browse/YARN-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100949#comment-14100949 ] Sunil G commented on YARN-2390: --- Thank you [~zjshen] I have checked *RMAppImpl#getFinalApplicationStatus*. If *currentAttempt.getFinalApplicationStatus()* is null (cases where AM has done unregister without specifying the final status), then final status is created by RM (calling *RMAppImpl#createFinalApplicationStatus()*) How do you feel about this. Investigating whehther generic history service needs to support queue-acls -- Key: YARN-2390 URL: https://issues.apache.org/jira/browse/YARN-2390 Project: Hadoop YARN Issue Type: Sub-task Reporter: Zhijie Shen Assignee: Sunil G According YARN-1250, it's arguable whether queue-acls should be applied to the generic history service as well, because the queue admin may not need the access to the completed application that is removed from the queue. Create this ticket to tackle the discussion around. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2310) Revisit the APIs in RM web services where user information can make difference
[ https://issues.apache.org/jira/browse/YARN-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14100963#comment-14100963 ] Sunil G commented on YARN-2310: --- Yes. getAppAttempts and getAppState could also fall in to this ACL check. Only problem is, *getAppAttempts* does not have HttpServletRequest hsr Context. {code} public AppAttemptsInfo getAppAttempts(@PathParam(appid) String appId){code} Hence getting UGI information without HttpServletRequest is not possible for getAppAttempts api. Revisit the APIs in RM web services where user information can make difference -- Key: YARN-2310 URL: https://issues.apache.org/jira/browse/YARN-2310 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, webapp Affects Versions: 3.0.0, 2.5.0 Reporter: Zhijie Shen After YARN-2247, RM web services can be sheltered by the authentication filter, which can help to identify who the user is. With this information, we should be able to fix the security problem of some existing APIs, such as getApp, getAppAttempts, getApps. We should use the user information to check the ACLs before returning the requested data to the user. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101024#comment-14101024 ] Jason Lowe commented on YARN-2034: -- Description looks OK, but the whitespace formatting for the other entries for this property were (inadvertently?) changed and the entry is now inconsistently indented. Could you please update the patch so just the description line is being modified? Thanks! Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Labels: documentation Attachments: YARN-2034.patch, YARN-2034.patch The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101034#comment-14101034 ] Subramaniam Krishnan commented on YARN-2385: [~sunilg], [~leftnoteasy], [~zjshen] I suggest we either open a new JIRA to discuss splitting of getAppsinQueue to getRunningAppsInQueue + getPendingAppsInQueue or update the current JIRA to reflect the discussion? Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Krishnan Assignee: Karthik Kambatla Labels: abstractyarnscheduler This JIRA proposes adding a method in AbstractYarnScheduler to get all the pending/active applications. Fair scheduler already supports moving a single application from one queue to another. Support for the same is being added to Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition of this method, we can transparently add support for moving all applications from source queue to target queue and draining a queue, i.e. killing all applications in a queue as proposed by YARN-2389 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2385) Adding support for listing all applications in a queue
[ https://issues.apache.org/jira/browse/YARN-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan updated YARN-2385: --- Assignee: (was: Karthik Kambatla) Adding support for listing all applications in a queue -- Key: YARN-2385 URL: https://issues.apache.org/jira/browse/YARN-2385 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler, fairscheduler Reporter: Subramaniam Krishnan Labels: abstractyarnscheduler This JIRA proposes adding a method in AbstractYarnScheduler to get all the pending/active applications. Fair scheduler already supports moving a single application from one queue to another. Support for the same is being added to Capacity Scheduler as part of YARN-2378 and YARN-2248. So with the addition of this method, we can transparently add support for moving all applications from source queue to target queue and draining a queue, i.e. killing all applications in a queue as proposed by YARN-2389 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2315) Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler.
[ https://issues.apache.org/jira/browse/YARN-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101045#comment-14101045 ] zhihai xu commented on YARN-2315: - Karthik, thanks for the review. I will implement a test case. Also setCurrentCapacity should be getResourceUsage().getMemory()/getFairShare().getMemory()(current capacity is percentage resource used in your share). I will make this change also. Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. --- Key: YARN-2315 URL: https://issues.apache.org/jira/browse/YARN-2315 Project: Hadoop YARN Issue Type: Bug Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2315.patch Should use setCurrentCapacity instead of setCapacity to configure used resource capacity for FairScheduler. In function getQueueInfo of FSQueue.java, we call setCapacity twice with different parameters so the first call is overrode by the second call. queueInfo.setCapacity((float) getFairShare().getMemory() / scheduler.getClusterResource().getMemory()); queueInfo.setCapacity((float) getResourceUsage().getMemory() / scheduler.getClusterResource().getMemory()); We should change the second setCapacity call to setCurrentCapacity to configure the current used capacity. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2190) Provide a Windows container executor that can limit memory and CPU
[ https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101097#comment-14101097 ] Ivan Mitic commented on YARN-2190: -- Thanks Chuan for the new patch. I have a few minor comments left: 1. {code} jcrci.CpuRate = max(1, vcores * 1 / sysinfo.dwNumberOfProcessors); {code} Did you want {{min}} here? 2. {{vcores * 1 / sysinfo.dwNumberOfProcessors}} Can you please add braces to signify that multiplication should be done before division? I think this is correct but I personally think it is better to be explicit. Provide a Windows container executor that can limit memory and CPU -- Key: YARN-2190 URL: https://issues.apache.org/jira/browse/YARN-2190 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Chuan Liu Assignee: Chuan Liu Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch Yarn default container executor on Windows does not set the resource limit on the containers currently. The memory limit is enforced by a separate monitoring thread. The container implementation on Windows uses Job Object right now. The latest Windows (8 or later) API allows CPU and memory limits on the job objects. We want to create a Windows container executor that sets the limits on job objects thus provides resource enforcement at OS level. http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2190) Provide a Windows container executor that can limit memory and CPU
[ https://issues.apache.org/jira/browse/YARN-2190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chuan Liu updated YARN-2190: Attachment: YARN-2190.5.patch Attach a patch addressing latest comments. Thanks for review! Provide a Windows container executor that can limit memory and CPU -- Key: YARN-2190 URL: https://issues.apache.org/jira/browse/YARN-2190 Project: Hadoop YARN Issue Type: New Feature Components: nodemanager Reporter: Chuan Liu Assignee: Chuan Liu Attachments: YARN-2190-prototype.patch, YARN-2190.1.patch, YARN-2190.2.patch, YARN-2190.3.patch, YARN-2190.4.patch, YARN-2190.5.patch Yarn default container executor on Windows does not set the resource limit on the containers currently. The memory limit is enforced by a separate monitoring thread. The container implementation on Windows uses Job Object right now. The latest Windows (8 or later) API allows CPU and memory limits on the job objects. We want to create a Windows container executor that sets the limits on job objects thus provides resource enforcement at OS level. http://msdn.microsoft.com/en-us/library/windows/desktop/ms686216(v=vs.85).aspx -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2034: -- Attachment: YARN-2034-2.patch Thank you for reviewing this, [~jlowe]. Patch updated. Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Labels: documentation Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2395) Fair Scheduler : implement fair share preemption at parent queue based on fairSharePreemptionTimeout
[ https://issues.apache.org/jira/browse/YARN-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan reassigned YARN-2395: - Assignee: Wei Yan Fair Scheduler : implement fair share preemption at parent queue based on fairSharePreemptionTimeout Key: YARN-2395 URL: https://issues.apache.org/jira/browse/YARN-2395 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Currently in fair scheduler, the preemption logic considers fair share starvation only at leaf queue level. This jira is created to implement it at the parent queue as well. It involves : 1. Making check for fair share starvation and amount of resource to preempt recursive such that they traverse the queue hierarchy from root to leaf. 2. Currently fairSharePreemptionTimeout is a global config. We could make it configurable on a per queue basis,so that we can specify different timeouts for parent queues. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
[ https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan reassigned YARN-2394: - Assignee: Wei Yan Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue Key: YARN-2394 URL: https://issues.apache.org/jira/browse/YARN-2394 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Preemption based on fair share starvation happens when usage of a queue is less than 50% of its fair share. This 50% is hardcoded. We'd like to make this configurable on a per queue basis, so that we can choose the threshold at which we want to preempt. Calling this config fairSharePreemptionThreshold. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (YARN-2415) Expose MiniYARNCluster for use outside of YARN
[ https://issues.apache.org/jira/browse/YARN-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karthik Kambatla reassigned YARN-2415: -- Assignee: Wei Yan (was: Karthik Kambatla) Wei is looking into this. Expose MiniYARNCluster for use outside of YARN -- Key: YARN-2415 URL: https://issues.apache.org/jira/browse/YARN-2415 Project: Hadoop YARN Issue Type: New Feature Components: client Affects Versions: 2.5.0 Reporter: Hari Shreedharan Assignee: Wei Yan The MR/HDFS equivalents are available for applications to use in tests, but the YARN Mini cluster is not. It would be really useful to test applications that are written to run on YARN (like Spark) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101250#comment-14101250 ] Zhijie Shen commented on YARN-2249: --- 1. Do the following in AbstractYarnScheduler.serviceInit? {code} +super.nmExpireInterval = +conf.getInt(YarnConfiguration.RM_NM_EXPIRY_INTERVAL_MS, + YarnConfiguration.DEFAULT_RM_NM_EXPIRY_INTERVAL_MS); {code} {code} +createReleaseCache(); {code} 2. Add RM_NM_EXPIRY_INTERVAL_MS in yarn-default.xml? 3. Not sure it's going to be an efficient data structure. Different apps' containers should not affect each other, right? mutex on the whole collection seems to be a too coarse granularity (blocking allocate call). Should we use MapAppAttemptId, ListContainerId and make each app have separate mutex? {code} + private SetContainerId pendingRelease = null; + private final Object mutex = new Object(); {code} AM release request may be lost on RM restart Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
[ https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101260#comment-14101260 ] Wei Yan commented on YARN-2394: --- I'll look into this. Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue Key: YARN-2394 URL: https://issues.apache.org/jira/browse/YARN-2394 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Preemption based on fair share starvation happens when usage of a queue is less than 50% of its fair share. This 50% is hardcoded. We'd like to make this configurable on a per queue basis, so that we can choose the threshold at which we want to preempt. Calling this config fairSharePreemptionThreshold. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2422) yarn.scheduler.maximum-allocation-mb should not be hard-coded in yarn-default.xml
[ https://issues.apache.org/jira/browse/YARN-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101274#comment-14101274 ] Sandy Ryza commented on YARN-2422: -- I think it's weird to have a nodemanager property impact what goes on in the ResourceManager. Using this property would be especially weird on heterogeneous clusters where resources vary from node to node. Preferable would be to, independently of yarn.scheduler.maximum-allocation-mb, make the ResourceManager reject any requests that are larger than the largest node in the cluster. And then default yarn.scheduler.maximum-allocaiton-mb to infinite. yarn.scheduler.maximum-allocation-mb should not be hard-coded in yarn-default.xml - Key: YARN-2422 URL: https://issues.apache.org/jira/browse/YARN-2422 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.6.0 Reporter: Gopal V Priority: Minor Attachments: YARN-2422.1.patch Cluster with 40Gb NM refuses to run containers 8Gb. It was finally tracked down to yarn-default.xml hard-coding it to 8Gb. In case of lack of a better override, it should default to - ${yarn.nodemanager.resource.memory-mb} instead of a hard-coded 8Gb. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-415: Attachment: YARN-415.201408181938.txt [~jianhe], thank you for your continuing reviews and comments. {quote} Particularly in work-preserving AM restart, current AM is actually the one who's managing previous running containers. Running containers in scheduler are already transferred to the current AM. So running containers metrics are transferred as well. I think it'll be confusing if finished containers are still charged back against the previous dead attempt. Btw, YARN-1809 will add the attempt web page where we could show attempt-specific metrics also. {quote} You are correct. In the work-preserving AM restart case, the live containers are transferred to the new attempt for the remaining lifetime of the container, and then when the container completes, the original attempt gets the CONTAINER_FINISHED event. But I see your point about being consistent in the work-preserving AM restart case. So, I have attached a patch which will charge container usage to the current attempt, whether the container is running or completed. {quote} Regarding the problem of metrics persistency. Agree that it doesn't solve the problem for running apps in general. Maybe we can have the state store changes in a separate jira and discuss more there, so that we can get this in first. {quote} Yes, I would appreciate it if we could continue this discussion on a separate JIRA. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101341#comment-14101341 ] Eric Payne commented on YARN-415: - [~kkambatl], thank you for taking the time to review this patch. I would like to see if [~kthrapp] could comment on your use case questions, but here are my initial thoughts: {quote} 1. Is the chargeback simply to track the usage and may be financially charge the users. Or, is to influence future scheduling decisions? I agree that the RM should facilitate collecting this information, but should the collected info be available to the RM for future use? If not, do we want the RM to serve this info? {quote} Potential goals could be: # report (and charge for) grid usage # eventually limit job submission based on a users' budget {quote} 2. Do we want to charge the app only for the resources used to do meaningful work or do we also want to include failed/preempted containers? If we don't charge the app for failed containers, who are they charged to? Are we okay with letting some resources go uncharged? {quote} This implementation does charge the app for failed containers. This was debated somewhat previously in this JIRA, because if the failure was due to preemption or a bug that wasn't the app's fault, it may be unfair to charge the app for those. However, it is very unclear how one could programmatically determine whose fault the failure is. {quote} 3. How soon do we want this usage information? It might make sense to collect/expose this once the app is finished for certain kinds of applications. What is our story for long-running applications? {quote} There is a specific use case for determine the usage at runtime. Again, I would hope that [~kthrapp] could elaborate on this. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1514) Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA
[ https://issues.apache.org/jira/browse/YARN-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101381#comment-14101381 ] Hadoop QA commented on YARN-1514: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662518/YARN-1514.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4666//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4666//console This message is automatically generated. Utility to benchmark ZKRMStateStore#loadState for ResourceManager-HA Key: YARN-1514 URL: https://issues.apache.org/jira/browse/YARN-1514 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Fix For: 2.6.0 Attachments: YARN-1514.1.patch, YARN-1514.2.patch, YARN-1514.3.patch, YARN-1514.4.patch, YARN-1514.4.patch, YARN-1514.5.patch, YARN-1514.wip-2.patch, YARN-1514.wip.patch ZKRMStateStore is very sensitive to ZNode-related operations as discussed in YARN-1307, YARN-1378 and so on. Especially, ZKRMStateStore#loadState is called when RM-HA cluster does failover. Therefore, its execution time impacts failover time of RM-HA. We need utility to benchmark time execution time of ZKRMStateStore#loadStore as development tool. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2249) AM release request may be lost on RM restart
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2249: -- Attachment: YARN-2249.5.patch AM release request may be lost on RM restart Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2249) AM release request may be lost on RM restart
[ https://issues.apache.org/jira/browse/YARN-2249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101443#comment-14101443 ] Jian He commented on YARN-2249: --- Thanks Zhijie for the review ! bq. Do the following in AbstractYarnScheduler.serviceInit? fixed. bq. Add RM_NM_EXPIRY_INTERVAL_MS in yarn-default.xml? It is already present. bq. Not sure it's going to be an efficient data structure. Different apps' containers should not affect each other, right? mutex on the whole collection seems to be a too coarse granularity (blocking allocate call). Should we use MapAppAttemptId, ListContainerId and make each app have separate mutex? I moved the pendingReleases to SchedulerApplicationAttempt and lock the attempt object instead. AM release request may be lost on RM restart Key: YARN-2249 URL: https://issues.apache.org/jira/browse/YARN-2249 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-2249.1.patch, YARN-2249.1.patch, YARN-2249.2.patch, YARN-2249.2.patch, YARN-2249.3.patch, YARN-2249.4.patch, YARN-2249.5.patch AM resync on RM restart will send outstanding container release requests back to the new RM. In the meantime, NMs report the container statuses back to RM to recover the containers. If RM receives the container release request before the container is actually recovered in scheduler, the container won't be released and the release request will be lost. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-415) Capture memory utilization at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101445#comment-14101445 ] Kendall Thrapp commented on YARN-415: - {quote} 1. Is the chargeback simply to track the usage and may be financially charge the users. Or, is to influence future scheduling decisions? I agree that the RM should facilitate collecting this information, but should the collected info be available to the RM for future use? If not, do we want the RM to serve this info? {quote} In addition to the goals [~eepayne] listed, another goal is to make it easier for users to compare how code changes to a particular recurring Hadoop job affect its resource usage. Assuming input data size didn't significantly change, It'd be much more apparent after to the user after a code change if there was a resulting significant change in the resource usage for their job. Even without charging, I'm hoping that having the resource usage shown to the user, without any extra work on their part, will make more people think about their overall grid resource usage, instead of just run times. Capture memory utilization at the app-level for chargeback -- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 0.23.6 Reporter: Kendall Thrapp Assignee: Andrey Klochkov Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101509#comment-14101509 ] Ravi Prakash commented on YARN-2424: Thanks Tucu for pointing out the security implications of allowing un-authenticated users to run tasks as themselves (or impersonate others) on nodes. I agree that is not something we should turn on by default. That is why I think the default value for DEFAULT_NM_NONSECURE_MODE_LIMIT_USERS to be true is necessary. However, there is a use case as pointed out by Allen (as a stepping stone towards turning on Kerberos) that we at Altiscale and presumably others also have (e.g. Jay's last comment on YARN-1253). Thanks for this patch Allen! I'll take a look at it. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2394) Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue
[ https://issues.apache.org/jira/browse/YARN-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-2394: -- Attachment: YARN-2394-1.patch Fair Scheduler : ability to configure fairSharePreemptionThreshold per queue Key: YARN-2394 URL: https://issues.apache.org/jira/browse/YARN-2394 Project: Hadoop YARN Issue Type: New Feature Components: fairscheduler Reporter: Ashwin Shankar Assignee: Wei Yan Attachments: YARN-2394-1.patch Preemption based on fair share starvation happens when usage of a queue is less than 50% of its fair share. This 50% is hardcoded. We'd like to make this configurable on a per queue basis, so that we can choose the threshold at which we want to preempt. Calling this config fairSharePreemptionThreshold. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101533#comment-14101533 ] Alejandro Abdelnur commented on YARN-2424: -- I really don't like it, it is not my business how you run your clusters, but this is dangerous, specially in a multi-tenancy scenario. From Allen's comment (the one I highlighted) it is not clear to me this is meant only for setup/troubleshooting usage. I would not -1 this JIRA if... * the property has 'use-only-for-troubleshooting' in its name. * the NM logs print a WARN at startup and on every started container stating the flag and its un-secure nature * the container stdout/stderr also print a WARN to alert the user of the cluster setup. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Labels: regression Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravi Prakash updated YARN-2424: --- Labels: (was: regression) LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-115) yarn commands shouldn't add m to the heapsize
[ https://issues.apache.org/jira/browse/YARN-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allen Wittenauer resolved YARN-115. --- Resolution: Duplicate Between HADOOP-9902 and HADOOP-10950, this issue will be fully covered. Closing as a dupe. yarn commands shouldn't add m to the heapsize --- Key: YARN-115 URL: https://issues.apache.org/jira/browse/YARN-115 Project: Hadoop YARN Issue Type: Improvement Affects Versions: 0.23.3 Reporter: Thomas Graves Labels: usability the yarn commands add m to the heapsize. This is unlike the hdfs side and the the old jt/tt used to do. JAVA_HEAP_MAX=-Xmx$YARN_RESOURCEMANAGER_HEAPSIZEm JAVA_HEAP_MAX=-Xmx$YARN_NODEMANAGER_HEAPSIZEm We should not be adding in the m and allow the user to specify units. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101568#comment-14101568 ] Jian He commented on YARN-1372: --- bq. Not sure if there is an easier way to link the two right now as the application cleanup lifecycle also converts into a Container Kill just like any other container Kill. I meant can we remove all the containers in NMContext once we received the NodeHeartbeatResponse#getApplicationsToCleanup notification, instead of depending on expiration. Because applications are already completed at this point when receiving the applicationsToCleanUp, the containers kept in NMContext may not be needed any more. bq. This it to allow a separate set of justFinishedContainers that can be used for returning to AM and at the same time acknowledging the previous returned set to NM. the same justFinishedContainers set can be used to return to AM and ack NMs? bq. DECOMMISSIONED/LOST state possible to receive the new event? sorry for being unclear. I meant is it possible for NM at DECOMMISSIONED/LOST state to receive the newly added CLEANEDUP_CONTAINER_NOTIFIED event ? If so, we need to handle them too. Patch is not applying anymore. Can you update the patch please? thx Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1919) Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE
[ https://issues.apache.org/jira/browse/YARN-1919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101605#comment-14101605 ] Jian He commented on YARN-1919: --- looks good to me. Log yarn.resourcemanager.cluster-id is required for HA instead of throwing NPE -- Key: YARN-1919 URL: https://issues.apache.org/jira/browse/YARN-1919 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0 Reporter: Devaraj K Assignee: Tsuyoshi OZAWA Priority: Minor Attachments: YARN-1919.1.patch, YARN-1919.2.patch {code:xml} 2014-04-09 16:14:16,392 WARN org.apache.hadoop.service.AbstractService: When stopping the service org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService : java.lang.NullPointerException java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.serviceStop(EmbeddedElectorService.java:108) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:171) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.AdminService.serviceInit(AdminService.java:122) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:232) at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:1038) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (YARN-2386) Refactor common scheduler configurations into a base ResourceSchedulerConfig class
[ https://issues.apache.org/jira/browse/YARN-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Subramaniam Krishnan resolved YARN-2386. Resolution: Invalid Took a look into both the scheduler configs and unfortunately the configurations are so disparate that there isn't much common to refactor out. Refactor common scheduler configurations into a base ResourceSchedulerConfig class -- Key: YARN-2386 URL: https://issues.apache.org/jira/browse/YARN-2386 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subramaniam Krishnan Assignee: Subramaniam Krishnan As discussed with [~leftnoteasy], [~jianhe] and [~kasha], this JIRA proposes refactoring common configuration from Capacity Fair scheduler to a common base class to avoid duplicating configs. Currently Capacity Fair scheduler directly extend configuration and adding a common base Resource scheduler config class would also align with the Resource Scheduler hierarchy and enable other systems like reservation system (YARN-2080) to be scheduler implementation agnostic. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2428) LCE default banned user list should have yarn
Allen Wittenauer created YARN-2428: -- Summary: LCE default banned user list should have yarn Key: YARN-2428 URL: https://issues.apache.org/jira/browse/YARN-2428 Project: Hadoop YARN Issue Type: Bug Reporter: Allen Wittenauer When task-controller was retrofitted to YARN, the default banned user list didn't add yarn. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2429) LCE should blacklist based upon group
Allen Wittenauer created YARN-2429: -- Summary: LCE should blacklist based upon group Key: YARN-2429 URL: https://issues.apache.org/jira/browse/YARN-2429 Project: Hadoop YARN Issue Type: New Feature Reporter: Allen Wittenauer It should be possible to list a group to ban, not just individual users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2424) LCE should support non-cgroups, non-secure mode
[ https://issues.apache.org/jira/browse/YARN-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101677#comment-14101677 ] Ravi Prakash commented on YARN-2424: Hi Tucu! Thanks for your comment. There is currently capability to blacklist / whitelist users in the container-executor.cfg file. Given this capability, do you think in a properly configured cluster, yarn tasks launching as different users could create problems? This is with the assumption that most clusters do not have NFS mounts on the slave nodes. As an aside I think it would be good to add a blacklist + whitelist for groups as well. LCE should support non-cgroups, non-secure mode --- Key: YARN-2424 URL: https://issues.apache.org/jira/browse/YARN-2424 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.3.0, 2.4.0, 2.5.0, 2.4.1 Reporter: Allen Wittenauer Priority: Blocker Attachments: YARN-2424.patch After YARN-1253, LCE no longer works for non-secure, non-cgroup scenarios. This is a fairly serious regression, as turning on LCE prior to turning on full-blown security is a fairly standard procedure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()
[ https://issues.apache.org/jira/browse/YARN-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101686#comment-14101686 ] Maysam Yabandeh commented on YARN-2430: --- Here are the current alternative solutions: 1. a simple, quick fix would be to cache the result of getResourceUsage in a field of Schedulable and invalidate the cache after each scheduling. The invalidation requires iteration on all schedulables with cost O( n ). 2. alternatively as suggested by Karthik the cached result could be updated periodically as part of UpdateThread. This approach would also encourage moving the sorting also to the UpdateThread since the sort algorithm is no longer provided with the most up-to-date data. 3. Karthik also brought up the option of bottom-up update of the resource usage when something gets updated: each Schedulable pushes up the change in its resource usage after each change. This would require invoking the push-up method at the right places. Care must be taken in future changes not to forget calling the push-up method. I would highly appreciate the comments. FairShareComparator: cache the results of getResourceUsage() Key: YARN-2430 URL: https://issues.apache.org/jira/browse/YARN-2430 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh The compare of FairShareComparator has 3 invocation of getResourceUsage per comparable object. In the case of queues, the implementation of getResourceUsage requires iterating over the apps and adding up their current usage. The compare method can reuse the result of getResourceUsage to reduce the load by third. However, to further reduce the load the result of getResourceUsage can be cached in FSLeafQueue. This would be more efficient since the invocation of compare method on each Comparable object is = 1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()
Maysam Yabandeh created YARN-2430: - Summary: FairShareComparator: cache the results of getResourceUsage() Key: YARN-2430 URL: https://issues.apache.org/jira/browse/YARN-2430 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh The compare of FairShareComparator has 3 invocation of getResourceUsage per comparable object. In the case of queues, the implementation of getResourceUsage requires iterating over the apps and adding up their current usage. The compare method can reuse the result of getResourceUsage to reduce the load by third. However, to further reduce the load the result of getResourceUsage can be cached in FSLeafQueue. This would be more efficient since the invocation of compare method on each Comparable object is = 1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2430) FairShareComparator: cache the results of getResourceUsage()
[ https://issues.apache.org/jira/browse/YARN-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101732#comment-14101732 ] Sandy Ryza commented on YARN-2430: -- I believe #3 is the best approach as it's more performant than #1 and #2 has correctness issues. I actually implemented it a little while ago as part of YARN-1297 and will try to get that in. FairShareComparator: cache the results of getResourceUsage() Key: YARN-2430 URL: https://issues.apache.org/jira/browse/YARN-2430 Project: Hadoop YARN Issue Type: Improvement Reporter: Maysam Yabandeh Assignee: Maysam Yabandeh The compare of FairShareComparator has 3 invocation of getResourceUsage per comparable object. In the case of queues, the implementation of getResourceUsage requires iterating over the apps and adding up their current usage. The compare method can reuse the result of getResourceUsage to reduce the load by third. However, to further reduce the load the result of getResourceUsage can be cached in FSLeafQueue. This would be more efficient since the invocation of compare method on each Comparable object is = 1. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-2034) Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect
[ https://issues.apache.org/jira/browse/YARN-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101822#comment-14101822 ] Hadoop QA commented on YARN-2034: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12662550/YARN-2034-2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+0 tests included{color}. The patch appears to be a documentation patch that doesn't require tests. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4668//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4668//console This message is automatically generated. Description for yarn.nodemanager.localizer.cache.target-size-mb is incorrect Key: YARN-2034 URL: https://issues.apache.org/jira/browse/YARN-2034 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 0.23.10, 2.4.0 Reporter: Jason Lowe Assignee: Chen He Priority: Minor Labels: documentation Attachments: YARN-2034-2.patch, YARN-2034.patch, YARN-2034.patch The description in yarn-default.xml for yarn.nodemanager.localizer.cache.target-size-mb says that it is a setting per local directory, but according to the code it's a setting for the entire node. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1297) Miscellaneous Fair Scheduler speedups
[ https://issues.apache.org/jira/browse/YARN-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14101836#comment-14101836 ] Karthik Kambatla commented on YARN-1297: I can take a look at an updated patch. Miscellaneous Fair Scheduler speedups - Key: YARN-1297 URL: https://issues.apache.org/jira/browse/YARN-1297 Project: Hadoop YARN Issue Type: Improvement Components: scheduler Reporter: Sandy Ryza Assignee: Sandy Ryza Attachments: YARN-1297-1.patch, YARN-1297-2.patch, YARN-1297.patch, YARN-1297.patch I ran the Fair Scheduler's core scheduling loop through a profiler tool and identified a bunch of minimally invasive changes that can shave off a few milliseconds. The main one is demoting a couple INFO log messages to DEBUG, which brought my benchmark down from 16000 ms to 6000. A few others (which had way less of an impact) were * Most of the time in comparisons was being spent in Math.signum. I switched this to direct ifs and elses and it halved the percent of time spent in comparisons. * I removed some unnecessary instantiations of Resource objects * I made it so that queues' usage wasn't calculated from the applications up each time getResourceUsage was called. -- This message was sent by Atlassian JIRA (v6.2#6252)